Low-Resource Language Modelling of South African Languages

Mesham, Stuart and Hayward, Luc and Shapiro, Jared and Buys, Jan (2021) Low-Resource Language Modelling of South African Languages, Proceedings of The Second Southern African Conference for AI Research, 6-10 December 2021, Dolphin Coast, KwaZulu-Natal.

[thumbnail of Low_Resource_Language_Modelling_of_South_African_Languages__SACAIR_.pdf] Text
Low_Resource_Language_Modelling_of_South_African_Languages__SACAIR_.pdf - Accepted Version

Download (387kB)


Language models are the foundation of current neural network-based models for natural language understanding and generation. However, research on the intrinsic performance of language models on African languages has been extremely limited, and is made more challenging by the lack of large or standardised training and evaluation sets that exist for English and other high-resource languages. In this paper, we evaluate the performance of open-vocabulary language models on low-resource South African languages, using byte-pair encoding to handle the rich morphology of these languages. We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs), and Transformers on small-scale datasets. Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets. Multilingual training further improves performance on these datasets. We hope that this work will open new avenues for research into multi-lingual and low-resource language modelling for African languages.

Item Type: Conference paper
Subjects: Computing methodologies > Artificial intelligence > Natural language processing
Date Deposited: 03 Dec 2021 11:20
Last Modified: 03 Dec 2021 11:20
URI: https://pubs.cs.uct.ac.za/id/eprint/1493

Actions (login required)

View Item View Item