SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation

Song, Haiyue and Meyer, Francois and Dabre, Raj and Tanaka, Hideki and Chu, Chenhui and Kurohashi, Sadao (2024) SubMerge: Merging Equivalent Subword Tokenizations for Subword Regularized Models in Neural Machine Translation, Proceedings of Annual Conference of the European Association for Machine Translation (EAMT 2024), Sheffield, United Kingdom.

[thumbnail of Submerge.pdf] Text
Submerge.pdf - Published Version
Available under License Creative Commons Attribution Non-commercial No Derivatives.

Download (648kB)

Abstract

Subword regularized models leverage multiple subword tokenizations of one target sentence during training. Previous decoding algorithms select one tokenization during inference, leading to the underutilization of knowledge learned about multiple tokenizations. To address this, we propose the SubMerge algorithm to rescue the ignored Subword tokenizations through Merging equivalent ones during inference. SubMerge is a nested search algorithm where the outer beam search treats words as the minimal units, and the inner beam search provides a list of word candidates and their probabilities by merging subword tokenizations that form the same word. Experimental results on six machine translation datasets show more accurate word probability estimation and higher translation quality using SubMerge than beam search. Additionally, we provide time complexity analysis and investigate the effect of different beam sizes, training set sizes, dropout rates, and whether it is effective on non-regularized models.

Item Type: Conference paper
Subjects: Computing methodologies > Artificial intelligence > Natural language processing
Computing methodologies > Artificial intelligence > Natural language processing > Machine translation
Computing methodologies > Artificial intelligence > Natural language processing > Natural language generation
Date Deposited: 08 Aug 2024 08:47
Last Modified: 08 Aug 2024 08:47
URI: https://pubs.cs.uct.ac.za/id/eprint/1666

Actions (login required)

View Item View Item