UCT CS Research Document Archive

AfriWeb: A Web Search Engine for a Marginalized Language

Malumba, Nkosana, Katlego Moukangwe and Hussein Suleman (2015) AfriWeb: A Web Search Engine for a Marginalized Language. In Allen, Robert B, Jane Hunter and Marcia L. Zeng, Eds. Proceedings 17th International Conference on Asia-Pacific Digital Libraries, pages 180-189, Seoul, South Korea.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written documents available on the Internet. The lack of electronic documents and supporting infrastructure to store and retrieve documents in isiZulu is an additional threat for its survival as a written language. This paper documents an investigation into the creation of one such infrastructural element - a custom Web search engine - for isiZulu, where previously no such system was in existence. The focus of the search engine was on the language-specific elements of morphological parsing and statistical language modelling. Morphological parsing was shown to produce better results for isiZulu, an agglutinative language, than traditional affix-based stemming. Statistical language modelling was able to successfully separate isiZulu documents from others, thus enabling the use of a language-based focused crawler.

EPrint Type:Conference Paper
Keywords:isiZulu, Web search, morphological analysis, language modelling, focused crawling
Subjects:H Information Systems: H.1 MODELS AND PRINCIPLES
H Information Systems: H.3 INFORMATION STORAGE AND RETRIEVAL
ID Code:1065
Deposited By:Suleman, Hussein
Deposited On:26 January 2016