UCT CS Research Document Archive

Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages

Chavula, Catherine and Hussein Suleman (2016) Assessing the Impact of Vocabulary Similarity on Multilingual Information Retrieval for Bantu Languages. In Proceedings 8th Annual meeting of the Forum on Information Retrieval Evaluation (FIRE 2016), pages 16-23, Kolkata, India.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

Despite the availability of massive open information and efforts to promote multilingualism on the Web, content in Bantu languages remains negligible. Additionally, Information Retrieval (IR) systems, such as the Google search engine, use algorithms that work well with languages that have the most content. Similarities across related languages such as vocabulary overlap can potentially be exploited to provide more opportunities for information access for languages with limited digital content. This study investigates how vocabulary similarity impacts on the quality of search results in Multilingual Information Retrieval (MLIR) environments. More specifically, the study evaluates indexing strategies for MLIR and their effect on the quality of retrieval for related languages. A multilingual test collection consisting of two Bantu languages, Citumbuka and Chichewa, and English was developed and used in the evaluation. The results show that when comparing related and unrelated language pairs, MLIR indexing strategies result in comparable or worse retrieval performance.

EPrint Type:Conference Paper
Subjects:H Information Systems: H.3 INFORMATION STORAGE AND RETRIEVAL
ID Code:1161
Deposited By:Suleman, Hussein
Deposited On:17 Febuary 2017
Alternative Locations:https://doi.org/10.1145/3015157.3015160