UCT CS Research Document Archive

Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure

Chavula, Catherine and Hussein Suleman (2017) Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure. In Proceedings Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2017), Thaba 'Nchu, South Africa.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.

Abstract

Unsupervised morphological segmentation is attractive to low density languages with little linguistic description, such as many Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited to languages with simple morphological systems. The paper proposes a weighted similarity measure that uses an approach for calculating Ordered Weighted Aggregator (OWA) operator weights based on normal distribution. The weighting favours shared character sequences with high likelihood of being part of stems for highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, which belong to the group N of Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools.

EPrint Type:Conference Paper
Keywords:unsupervised morphological segmentation, similarity metrics
Subjects:H Information Systems: H.3 INFORMATION STORAGE AND RETRIEVAL
ID Code:1225
Deposited By:Suleman, Hussein
Deposited On:25 November 2017
Alternative Locations:https://dl.acm.org/citation.cfm?id=3129453