NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages

Meyer, Francois and Song, Haiyue and Chakrabarty, Abhisek and Buys, Jan and Dabre, Raj and Tanaka, Hideki (2024) NGLUEni: Benchmarking and Adapting Pretrained Language Models for Nguni Languages, Proceedings of Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy.

[thumbnail of 2024.lrec-main.1071.pdf] Text
2024.lrec-main.1071.pdf

Download (429kB)

Abstract

The Nguni languages have over 20 million home language speakers in South Africa. There has been considerable growth in the datasets for Nguni languages, but so far no analysis of the performance of NLP models for these languages has been reported across languages and tasks. In this paper we study pretrained language models for the 4 Nguni languages - isiXhosa, isiZulu, isiNdebele, and Siswati. We compile publicly available datasets for natural language understanding and generation, spanning 6 tasks and 11 datasets. This benchmark, which we call NGLUEni, is the first centralised evaluation suite for the Nguni languages, allowing us to systematically evaluate the Nguni-language capabilities of pretrained language models (PLMs). Besides evaluating existing PLMs, we develop new PLMs for the Nguni languages through multilingual adaptive finetuning. Our models, Nguni-XLMR and Nguni-ByT5, outperform their base models and large-scale adapted models, showing that performance gains are obtainable through limited language group-based adaptation. We also perform experiments on cross-lingual transfer and machine translation. Our models achieve notable cross-lingual transfer improvements in the lower resourced Nguni languages (isiNdebele and Siswati). To facilitate future use of NGLUEni as a standardised evaluation suite for the Nguni languages, we create a web portal to access the collection of datasets and publicly release our models.

Item Type: Conference paper
Subjects: Computing methodologies > Artificial intelligence > Natural language processing
Date Deposited: 08 Aug 2024 08:49
Last Modified: 08 Aug 2024 08:49
URI: https://pubs.cs.uct.ac.za/id/eprint/1669

Actions (login required)

View Item View Item