Measuring verb similarity using binary coefficients with application to isiXhosa and isiZulu

Mahlaza, Zola and Keet, C. Maria (2018) Measuring verb similarity using binary coefficients with application to isiXhosa and isiZulu, Proceedings of Annual Conference of the South African Institute of Computer Scientists and Information Technologists (SAICSIT'18)., September 26-28, 2018, Port Elizabeth, South Africa, e8, ACM.

[img] PDF
saicsit2018verbsimilarity.pdf

Download (709kB)

Abstract

Natural Language Processing (NLP) for underresourced languages may benefit from a bootstrapping approach to utilise the sparse resources across closely related languages. This brings afore the question of language similarity, and therewith the question of how to measure that, so as to make informed predictions on potential success of bootstrapping. We present a method for measuring morphosyntactic similarity by developing Context Free Grammars (CFGs) for isiXhosa and isiZulu verb fragments that are relevant for the use case of weather forecast generation. We then investigate morphosyntactic similarity of the CFGs using parse tree analysis and four binary similarity measures. In particular, we selected four binary similarity measures from other domains and adapted them to our data, which are the word sets generated from the respective CFGs. The similarity measures together with the parse tree analysis are used to study the the extent to which both languages can be generated by a singular grammar fragment. The resulting 52 rules for isiXhosa and 49 rules for isiZulu overlap on 42 rules. This supports the existing intuition of similarity, as they are in the same language cluster. The morphosyntactic similarity measured with the binary coefficients reached 59.5% overall (adapted Driver-Kroeber), with 99.5% for the past tense only. This lower score cf. the structure of the CFG is attributable to the small differences in terminals in mainly the prefix of the verb. The parse tree analysis and binary similarity measures show that a modularised set of rules for the prefix, verb root, and suffix would allow the generation of the two languages with a single grammar where only the prefix requires differentiation.

Item Type: Conference paper
Uncontrolled Keywords: isiZulu, isiXhosa, computational linguistics
Subjects: Computing methodologies
Alternate Locations: http://www.meteck.org/files/saicsit2018verbsimilarity.pdf
Date Deposited: 09 Nov 2018
Last Modified: 10 Oct 2019 15:31
URI: http://pubs.cs.uct.ac.za/id/eprint/1274

Actions (login required)

View Item View Item