Language identification for South African Bantu languages Using Rank Order Statistics

Dube, Meluleki and Suleman, Hussein (2019) Language identification for South African Bantu languages Using Rank Order Statistics, Proceedings of 21st International Conference on Asia-Pacific Digital Libraries (ICADL), 4-7 November 2019, Kuala Lumpur, Malaysia, Springer.

[img] PDF
icadl_2019_banturecognition.pdf

Download (275kB)

Abstract

Language identification is an important pre-process in many data management and information retrieval and transformation systems. However, Bantu languages are known to be difficult to identify because of lack of data and language similarity. This paper investigates the performance of n-gram counting using rank orders in order to discriminate among the different Bantu languages spoken in South Africa, using varying test and training data sizes. The highest average accuracy obtained was 99.3% with a testing size of 495 characters and training size of 600000 characters. The lowest average accuracy obtained was 78.72% when the testing size was 15 characters and learning size was 200000 characters.

Item Type: Conference paper
Subjects: Information systems
Date Deposited: 20 Sep 2019
Last Modified: 10 Oct 2019 15:31
URI: http://pubs.cs.uct.ac.za/id/eprint/1334

Actions (login required)

View Item View Item