Ranking by Language Similarity for Resource Scarce Southern Bantu Languages

Chavula, Catherine and Suleman, Hussein (2021) Ranking by Language Similarity for Resource Scarce Southern Bantu Languages, Proceedings of 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '21), 11 July 2021, Virtual Event, ACM.

[thumbnail of ictir077-chavulaA.pdf] Text
ictir077-chavulaA.pdf - Accepted Version

Download (322kB)


Resource Scarce Languages (RSLs) lack sufficient resources to use Cross-Lingual Information Retrieval (CLIR) techniques and tools such as machine translation. Consequentially, searching using RSLs is frustrating and usually ends in unsuccessful struggling search. In such search tasks, search engines return low-quality results; relevant documents are either limited and lowly ranked or non-existent. Previous work has shown that alternative relevant results written in similar languages, including dialects, neighbouring and genetically related languages, can assist multilingual RSLs speakers to complete their search tasks. To improve the quality of search results in this context, we propose the re-ranking of documents based on the similarity between the language of the document and the language of the query. Accordingly, we created a dataset of four Southern Bantu languages that includes documents, topics, topical relevance and intelligibility features, and document utility annotations. To understand the intelligibility dimension of the studied languages, we conducted online intelligibility test experiments and used the data for feature selection and intelligibility prediction. We performed re-ranking of search results using offline evaluation, exploring Learning To Rank (LTR). Our results show that integrating topical relevance and intelligibility in ranking slightly improves retrieval effectiveness. Further, results on intelligibility prediction show that classification of intelligibility is feasible at a fair accuracy.

Item Type: Conference paper
Uncontrolled Keywords: Multilingual Information Retrieval Retrieval Models and Ranking
Subjects: Information systems > Information retrieval > Specialized information retrieval > Structure and multilingual text search > Multilingual and cross-lingual retrieval
Alternate Locations: https://doi.org/10.1145/3471158.3472251
Date Deposited: 03 Dec 2021 11:38
Last Modified: 03 Dec 2021 11:38
URI: https://pubs.cs.uct.ac.za/id/eprint/1510

Actions (login required)

View Item View Item