Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages

Chavula, Catherine and Suleman, Hussein (2016) Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages, Proceedings of 8th Annual meeting of the Forum on Information Retrieval Evaluation (FIRE 2016), 8-10 December 2016, Kolkata, India, 16-23, ACM.

PDF
FIRE_2016_paper_22.pdf
Download (483kB)

Abstract

Despite the availability of massive open information and efforts to promote multilingualism on the Web, content in Bantu languages remains negligible. Additionally, Information Retrieval (IR) systems, such as the Google search engine, use algorithms that work well with languages that have the most content. Similarities across related languages such as vocabulary overlap can potentially be exploited to provide more opportunities for information access for languages with limited digital content. This study investigates how vocabulary similarity impacts on the quality of search results in Multilingual Information Retrieval (MLIR) environments. More specifically, the study evaluates indexing strategies for MLIR and their effect on the quality of retrieval for related languages. A multilingual test collection consisting of two Bantu languages, Citumbuka and Chichewa, and English was developed and used in the evaluation. The results show that when comparing related and unrelated language pairs, MLIR indexing strategies result in comparable or worse retrieval performance.

Item Type:	Conference paper
Subjects:	Information systems > Information retrieval
Alternate Locations:	https://doi.org/10.1145/3015157.3015160
Date Deposited:	17 Feb 2017
Last Modified:	10 Oct 2019 15:32
URI:	http://pubs.cs.uct.ac.za/id/eprint/1161

Actions (login required)

View Item