DHASA 2017

African Wordnets: first steps towards use in the Digital Humanities

Bosch, Sonja E; Griesel, Marissa

University of South Africa

The African Wordnet Project (AWN) aims at building wordnets for five African languages, namely Setswana, isiXhosa, isiZulu, Sesotho sa Leboa (Sepedi) and Tshivenda (see Griesel & Bosch, 2014 for a detailed introduction). The African wordnets are currently being developed manually by means of the expand model (Vossen, 1998) and are based on the English Princeton WordNet (PWN) (Fellbaum, 1998). A quality assurance phase formed part of the development of the roughly 55,000 synsets that are now included in the AWN.

Developing rich and hierarchically structured wordnets such as these for under resourced languages is labour intensive and costly. Apart from the various linguistic and lexicographic considerations that have to be made when developing a single entry in the wordnet, expert linguists also have to keep the interconnectivity of the meanings in mind. Synsets are linked in the different languages and to the English PWN in the structure of the database, but synonyms are also grouped together in different domains. This makes a wordnet in general and the AWN in particular useful for semantic analysis of texts in the African languages.

Any kind of annotation that adds information about the meaning of constituents in a text is regarded as semantic annotation. Semantically annotated or tagged corpora are considered one of the building blocks for natural language processing applications as well as investigations in the Digital Humanities (cf. Ide, 2004). These corpora are collections of texts that have been enriched with meanings in the form of synonyms for the key words. A definition, usage example and domain might accompany the tag. The resulting corpus can then be used in advanced (computerised) analysis of a text, machine translation or information retrieval. It is well-known that in languages such as English, wordnets have for many years served as a basis for sense tagging, one of the main reasons being that the Princeton WordNet for instance, is a freely available, online, machine-tractable lexicon providing extensive coverage of English.

At the moment no such resource exists for any of the African languages of South Africa, and therefore the development of the AWN can be considered as a first step towards the semantic analysis of texts and the subsequent creation of semantically tagged corpora. In this presentation we will show a novel application of wordnets in the Digital Humanities for the South African environment to create semantically tagged resources. The unique advantages of using a multilingual wordnet such as the AWN to gain semantic access to texts as an L2 reader will also be illustrated. We will conclude with recommendations for future development.

References:

Fellbaum, C., (ed), 1998. Wordnet: An electronic lexical database. The MIT Press, Cambridge, Mass. ISBN 978-0262- 06-197-1

Griesel, M. & Bosch, S. 2014. Taking stock of the African Wordnet Project: 5 years of development. In Orav, H., Fellbaum, C. & Vossen, P. (eds): Proceedings of the 7th Global WordNet Conference 2014 (GWC2014), pp. 148-153, Tartu, Estonia, ISBN 978-9949-32-492-7.

Ide, N. 2004. Preparation and Analysis of Linguistic Corpora. In Schreibman, S., Siemens, R. & Unsworth, J. (eds): A companion to Digital Humanities. Chapter 21. Oxford: Blackwell. [online] http://www.digitalhumanities.org/companion/. Accessed on 21 Octoboer 2016

Vossen, P. 1998. EuroWordNet: A multilingual database with lexical semantic networks. Kluwer Academic, Dordrecht. ISBN 0-7923-5295-5

DHASA2017 – Abstract

African Wordnets: first steps towards use in the Digital Humanities

Digital Humanities Association of Southern Africa