Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure
Chavula, Catherine and Hussein Suleman (2017) Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure. In Proceedings Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2017), Thaba 'Nchu, South Africa.
Unsupervised morphological segmentation is attractive to low density languages with little linguistic description, such as many Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited to languages with simple morphological systems. The paper proposes a weighted similarity measure that uses an approach for calculating Ordered Weighted Aggregator (OWA) operator weights based on normal distribution. The weighting favours shared character sequences with high likelihood of being part of stems for highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, which belong to the group N of Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools.
|EPrint Type:||Conference Paper|
|Keywords:||unsupervised morphological segmentation, similarity metrics|
|Subjects:||H Information Systems: H.3 INFORMATION STORAGE AND RETRIEVAL|
|Deposited By:||Suleman, Hussein|
|Deposited On:||25 November 2017|