The Effects of a Corpus on isiZulu Spellcheckers based on N-grams

Ndaba, Balone and Suleman, Hussein and Keet, C. Maria and Khumalo, Langa (2016) The Effects of a Corpus on isiZulu Spellcheckers based on N-grams, Proceedings of IST-Africa 2016, May 11-13, 2016, Durban, South Africa, IIMC International Information Management Corporation.

PDF
afrispeIST16crc.pdf
Download (364kB)

Abstract

Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable.

Item Type:	Conference paper
Uncontrolled Keywords:	spellchecker, n-grams, corpora, isiZulu
Subjects:	Applied computing > Document management and text processing
Alternate Locations:	http://www.meteck.org/files/afrispeIST16crc.pdf
Date Deposited:	18 Aug 2016
Last Modified:	10 Oct 2019 15:32
URI:	http://pubs.cs.uct.ac.za/id/eprint/1084

Actions (login required)

View Item