Chavula, Catherine and Suleman, Hussein (2017) Morphological Cluster Induction of Bantu Words Using a Weighted Similarity Measure, Proceedings of Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT 2017), 26-28 September 2017, Thaba 'Nchu, South Africa, ACM.
PDF
morphological-cluster-induction-camera.pdf Download (765kB) |
Abstract
Unsupervised morphological segmentation is attractive to low density languages with little linguistic description, such as many Bantu languages. However, techniques that cluster morphologically related words use string similarity metrics that are more suited to languages with simple morphological systems. The paper proposes a weighted similarity measure that uses an approach for calculating Ordered Weighted Aggregator (OWA) operator weights based on normal distribution. The weighting favours shared character sequences with high likelihood of being part of stems for highly agglutinative languages. The approach is evaluated on text for Chichewa and Citumbuka, which belong to the group N of Guthrie Bantu languages classification. Cluster analysis results show that the proposed weighted word similarity metric produces better clusters than Dice Coefficient. Morpheme segmentation results on clusters generated using the OWA weights metric are comparable to the state-of-the-art morphological analysis tools.
Item Type: | Conference paper |
---|---|
Uncontrolled Keywords: | unsupervised morphological segmentation, similarity metrics |
Subjects: | Information systems > Information retrieval |
Alternate Locations: | https://dl.acm.org/citation.cfm?id=3129453 |
Date Deposited: | 25 Nov 2017 |
Last Modified: | 10 Oct 2019 15:31 |
URI: | http://pubs.cs.uct.ac.za/id/eprint/1225 |
Actions (login required)
View Item |