Digital text data and automatic text processing tools for South African languages.
Dr. Roald Eiselen
Manager: Resource Development
Centre for Text Technology
North-West University
The digital analysis of textual information is a well-established discipline embodied in fields such as Corpus linguistics and various domains of computational linguistics. The aim of these approaches is to uncover linguistic patterns both in the surface text, such as collocations, differences in sentence structure in different domains, as well as the underlying linguistic structure inherent to all language. Computational linguistics focuses on the automatic processing of linguistic data sets in order to identify generalisations that can be applied to the infinite set of possible language expressions and allow computers to “understand” human language better.
Although the field of automatic text processing has a relatively long history, especially in European languages, South Africa is in a unique position in the field because of several factors. Firstly, South Africa has eleven official languages, all of which have different data, resource and processing tool availability, but in general all of these languages, apart from English, have a limited number of digital resources available. This resource scarcity forces researchers in the field to investigate novel methodologies and implementations that aim to maximise and reuse the available resources. Secondly, the fact that many of the South African languages have orthographies and morphosyntactic structures that are relatively far removed from their European counterparts, means that different approaches and technologies are required in order to do accurate linguistic and computational modelling. Lastly, the fact that there are few digital resources available provides the digital humanities community with a unique opportunity to develop new resources that will have a marked impact on the South African research community.
Although there is a substantial shortage of available digital resources, there have been various efforts to create initial data sets for most of the South African languages, both nationally and internationally. Over the last decade the National Centre for Human Language technology (NCHLT), has supported various resource and technology development efforts in order to support the language technology research and development community. As part of this effort, the NCHLT established the Language Resource Management Agency (RMA), an on-line distribution centre for language resources. Currently the RMA hosts more than 200 language resources, including corpora, annotated data sets for text and speech, as well as automatic processing tools for the eleven official languages of South Africa. Most of the resources available via the RMA are distributed under open licences that promote the use of the data for research and development in all aspects of language processing.
Aims
The workshop aims to enable researchers to develop compliant, reusable digital resources and use available automatic text processing tools in various fields of digital humanities where text processing is essential. The workshop specifically aims to:
- provide participants with background and best practices in the development, management and distribution of data resources, including corpora and annotated data sets;
- introduce participants to data sets that are freely available for linguistic research in the South African context, including data sets in all eleven official languages of South Africa;
- introduce participants to various tools that are available for the automatic processing of textual data, including lemmatisers, morphological analysers, part of speech taggers, named entity recognisers, phrase chunkers, and syntactic parsers;
- train participants in the use of the various tools available for automatic text processing;
- train participants in basic analysis methodologies used in the study of digital text data.
Location requirements
The workshop will require the use of a workstation
No internet connectivity is required.
Software requirements
Windows 7/8/10
None – all required software will be made available to attendants as standalone applications and do not require the installation of any additional software.