Malumba, Nkosana and Moukangwe, Katlego and Suleman, Hussein (2015) AfriWeb: A Web Search Engine for a Marginalized Language, Proceedings of 17th International Conference on Asia-Pacific Digital Libraries, 9-12 December 2015, Seoul, South Korea, 180-189, Springer.
PDF
paper_58.pdf Download (270kB) |
Abstract
isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written documents available on the Internet. The lack of electronic documents and supporting infrastructure to store and retrieve documents in isiZulu is an additional threat for its survival as a written language. This paper documents an investigation into the creation of one such infrastructural element - a custom Web search engine - for isiZulu, where previously no such system was in existence. The focus of the search engine was on the language-specific elements of morphological parsing and statistical language modelling. Morphological parsing was shown to produce better results for isiZulu, an agglutinative language, than traditional affix-based stemming. Statistical language modelling was able to successfully separate isiZulu documents from others, thus enabling the use of a language-based focused crawler.
Item Type: | Conference paper |
---|---|
Uncontrolled Keywords: | isiZulu, Web search, morphological analysis, language modelling, focused crawling |
Subjects: | Information systems Information systems > Information retrieval |
Date Deposited: | 26 Jan 2016 |
Last Modified: | 10 Oct 2019 15:32 |
URI: | http://pubs.cs.uct.ac.za/id/eprint/1065 |
Actions (login required)
View Item |