An overview of available speech technology resources developed by the Human Language Technology Research Group of the CSIR Meraka Institute.
Human language technology development is a data intensive endeavour. Since 2003, the Human Language Technology Research Group (HLTRG) at the CSIR, Meraka Institute has been steadily contributing to the development of language resources required for speech technology development. This work has had a significant impact in the country due to the fact that the official languages, apart from English, are resource-scarce languages. This means that there are limited electronic text and/or audio corpora available for these languages, not to speak of parallel corpora. Nevertheless, through projects such as the National Centre for Human Language Technology (NCHLT): Speech and Text projects, funded by the Department of Arts and Culture (DAC), and initiatives such as the Resource Management Agency, good progress has been made in developing the resources required for language technology development in our languages. This paper provides an overview of the HLTRG’s resource development work to date.
The paper begins by giving an overview of speech technology and contextualising the need for language resources for this type of work. While companies such as Google have developed systems such as Google voice search to assist with the collection of the resources required for language technology development, such systems do not yet exist in South Africa. A very brief mention of the HLT audit undertaken for the National HLT Network (NHN), will be provided to highlight the availability of resources at the time of the audit.
The paper will then briefly describe the speech technology work done by the HLTRG, to contextualise the resource development efforts undertaken to date. The HLTRG studies the way in which speech- and language-related technologies can be created and applied to benefit the people of southern Africa. Over the past 10 years, the HLTRG has been developing speech resources and technologies (ASR and TTS technologies) for South Africa’s resource-scarce environment. The HLTRG at the CSIR Meraka Institute has been working on:
1. Text-to-speech (TTS) technology development for the 11 official languages of South Africa, where current research is focusing on speech prosody, code-switching and proper name pronunciation in TTS, and maturing the TTS technology into operational systems.
2. Automatic speech recognition technology development for the 11 South African official languages where current research is on improving in-domain ASR performance and developing human capital in ASR research and development.
3. Resource development for the 11 official South African languages
a. The efforts began in 2006-2009 with annotated speech corpora for 11 languages, with close to 6 hours of telephone speech per language in the ASR corpora and almost 1 hour of speech data per language in the TTS corpora.
b. By the end of 2012, around 60 hours of broadband speech had been collected for each of the 11 official languages in the National Centre for Human Language Technologies Speech Resource Development project.
c. From 2013 to date, the focus has been on extending the NCHLT I speech corpora by automatically enhancing the annotation level of existing multilingual data sets.
The paper will provide details on the types of corpora available, descriptions of these, their terms of use, and how to access them, in as much detail as is possible within the time constraint. The paper will conclude with observations on possible future speech resource development work to be undertaken, based on needs that have been identified by the HLTRG in conjunction with industry partners and government departments. Some challenges in this regard may also be touched on briefly.