Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure

Chavula, Catherine and Suleman, Hussein (2017) Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure, Proceedings of Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2017), 26-28 September 2017, Thaba 'Nchu, South Africa, ACM.

PDF
morphological-cluster-induction-camera.pdf
Download (765kB)

Abstract

Unsupervised morphological segmentation is attractive to low density languages with little linguistic description, such as many Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited to languages with simple morphological systems. The paper proposes a weighted similarity measure that uses an approach for calculating Ordered Weighted Aggregator (OWA) operator weights based on normal distribution. The weighting favours shared character sequences with high likelihood of being part of stems for highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, which belong to the group N of Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools.

Item Type:	Conference paper
Uncontrolled Keywords:	unsupervised morphological segmentation, similarity metrics
Subjects:	Information systems > Information retrieval
Alternate Locations:	https://dl.acm.org/citation.cfm?id=3129453
Date Deposited:	25 Nov 2017
Last Modified:	10 Oct 2019 15:31
URI:	http://pubs.cs.uct.ac.za/id/eprint/1225

Actions (login required)

View Item