Data Augmentation for Low Resource Neural Machine Translation for Sotho-Tswana Languages

Mojapelo, Maxwell and Buys, Jan (2023) Data Augmentation for Low Resource Neural Machine Translation for Sotho-Tswana Languages, Proceedings of Southern African Conference for AI Research (SACAIR 2023), December 2023, Johannesburg, South Africa.

[thumbnail of Data_Augmentation_for_Low_Resource_Neural_Machine_Translation_for_Sotho_Tswana_Languages-1.pdf] Text
Data_Augmentation_for_Low_Resource_Neural_Machine_Translation_for_Sotho_Tswana_Languages-1.pdf - Accepted Version

Download (282kB)

Abstract

Neural Machine Translation (NMT) models have achieved remarkable performance on translating between high resource languages. However, translation quality for languages with limited data is much worse. This research focuses on the low resource language of Sepedi and considers two data augmentation techniques to increase the size and diversity of English-Sepedi corpora for training an NMT model. First we consider backtranslation, which makes use of the larger amount of available monolingual Sepedi text. We train a reverse (Sepedi to English) model and generate synthetic English sentences from the monolingual Sepedi sentences. These synthetic translations examples are added to the parallel English-Sepedi sentences. We carry out various experiments to investigate translation quality improvements. The second technique we consider is to generate synthetic data from parallel sentences between English and a closely-related language, Setswana. Setwana word are replacing with Sepedi words through an induced bilingual dictionary, which is created by using a supervised Generative Adversarial Network to align the embeddings of Sepedi and Setswana words. We evaluate our models on the JW300, FLoRes and Autshumato evaluation test sets, finding improvements over the current benchmark BLEU scores across all three datasets.

Item Type: Conference paper
Subjects: Computing methodologies > Artificial intelligence > Natural language processing
Computing methodologies > Artificial intelligence > Natural language processing > Machine translation
Date Deposited: 10 Nov 2023 14:50
Last Modified: 10 Nov 2023 14:50
URI: https://pubs.cs.uct.ac.za/id/eprint/1637

Actions (login required)

View Item View Item