AfriWeb: A Web Search Engine for a Marginalized Language

Malumba, Nkosana and Moukangwe, Katlego and Suleman, Hussein (2015) AfriWeb: A Web Search Engine for a Marginalized Language, Proceedings of 17th International Conference on Asia-Pacific Digital Libraries, 9-12 December 2015, Seoul, South Korea, 180-189, Springer.

[img] PDF
paper_58.pdf

Download (270kB)

Abstract

isiZulu is a Bantu language spoken by approximately 9 million people, but with very few written documents available on the Internet. The lack of electronic documents and supporting infrastructure to store and retrieve documents in isiZulu is an additional threat for its survival as a written language. This paper documents an investigation into the creation of one such infrastructural element - a custom Web search engine - for isiZulu, where previously no such system was in existence. The focus of the search engine was on the language-specific elements of morphological parsing and statistical language modelling. Morphological parsing was shown to produce better results for isiZulu, an agglutinative language, than traditional affix-based stemming. Statistical language modelling was able to successfully separate isiZulu documents from others, thus enabling the use of a language-based focused crawler.

Item Type: Conference paper
Uncontrolled Keywords: isiZulu, Web search, morphological analysis, language modelling, focused crawling
Subjects: Information systems
Information systems > Information retrieval
Date Deposited: 26 Jan 2016
Last Modified: 10 Oct 2019 15:32
URI: http://pubs.cs.uct.ac.za/id/eprint/1065

Actions (login required)

View Item View Item