AfriWeb: A Web Search Engine for a Marginalized Language
Malumba, Nkosana, Katlego Moukangwe and Hussein Suleman (2015) AfriWeb: A Web Search Engine for a Marginalized Language. In Allen, Robert B, Jane Hunter and Marcia L. Zeng, Eds. Proceedings 17th International Conference on Asia-Pacific Digital Libraries, pages 180-189, Seoul, South Korea.
isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written documents available on the Internet. The lack of electronic documents and supporting infrastructure to store and retrieve documents in isiZulu is an additional threat for its survival as a written language. This paper documents an investigation into the creation of one such infrastructural element - a custom Web search engine - for isiZulu, where previously no such system was in existence. The focus of the search engine was on the language-specific elements of morphological parsing and statistical language modelling. Morphological parsing was shown to produce better results for isiZulu, an agglutinative language, than traditional affix-based stemming. Statistical language modelling was able to successfully separate isiZulu documents from others, thus enabling the use of a language-based focused crawler.
|EPrint Type:||Conference Paper|
|Keywords:||isiZulu, Web search, morphological analysis, language modelling, focused crawling|
|Subjects:||H Information Systems: H.1 MODELS AND PRINCIPLES|
H Information Systems: H.3 INFORMATION STORAGE AND RETRIEVAL
|Deposited By:||Suleman, Hussein|
|Deposited On:||26 January 2016|