UCT CS Research Document Archive

The Effects of a Corpus on isiZulu Spellcheckers based on N-grams

Ndaba, Balone, Hussein Suleman, C. Maria Keet and Langa Khumalo (2016) The Effects of a Corpus on isiZulu Spellcheckers based on N-grams. In Cunningham, Paul and Miriam Cunningham, Eds. Proceedings IST-Africa 2016, Durban, South Africa.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.


Correct spelling contributes to good content accessibility and readability for textual documents. However, there are few spellcheckers for Bantu languages such as isiZulu, the major language in South Africa. The objective of this research is to investigate development of spellcheckers for isiZulu and, more generally, an approach that can be reused across Bantu languages. To fill this gap in an extensible way, we used data-driven statistical language models with trigrams and quadrigrams. The models were trained on three different isiZulu corpora, being Ukwabelana, a selection of the isiZulu National Corpus, and a small corpus of news items. The system performed better with trigrams than with quadrigrams, and performance depended on the training and testing corpora. When the system was trained with old text (bible in isiZulu), it did not perform well when tested with the two corpora that contain more recent texts, such as the constitution and news items. The highest accuracy obtained was 89%. Given that data-driven statistical language models constitute a language-independent approach, we conclude that data-driven spellcheckers for all Bantu languages are indeed feasible. They are, however, sensitive to the training and testing data. This is less resource-intensive compared to manual specification of rules, and therefore the potential impact on realising spellcheckers for Bantu languages is now practically within reach. The potential societal impact of spellchecker-supported tools and apps is incalculable.

EPrint Type:Conference Paper
Keywords:spellchecker, n-grams, corpora, isiZulu
Subjects:I Computing Methodologies: I.7 DOCUMENT AND TEXT PROCESSING
ID Code:1084
Deposited By:Keet, C. Maria
Deposited On:18 August 2016
Alternative Locations:http://www.meteck.org/files/afrispeIST16crc.pdf