Focused Crawling for Automated IsiXhosa Corpus Building

Marquard, Cael and Suleman, Hussein (2023) Focused Crawling for Automated IsiXhosa Corpus Building, Proceedings of SAICSIT 2023: South African Institute of Computer Scientists and Information Technologists, 2023, Muldersdrift, South Africa, Communications in Computer and Information Science, 1878, 19-31, Springer Nature Switzerland.

[thumbnail of Preprint of the paper "Focused Crawling for Automated IsiXhosa Corpus Building"] Text (Preprint of the paper "Focused Crawling for Automated IsiXhosa Corpus Building")
Focused Crawling for Automated IsiXhosa Corpus Building.pdf - Submitted Version

Download (296kB)

Abstract

IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corpora. This makes it difficult to perform many kinds of research with the language. This paper examines the use of focused Web crawling for automatic corpus generation. The resulting corpus is characterised using statistical methods: its vocabulary growth has been found to fit Heaps’ Law, and its word frequency has been found to be heavy-tailed. In addition, as expected, the corpus statistics did not match expectations from non-agglutinative languages

Item Type: Conference paper
Uncontrolled Keywords: Corpus, IsiXhosa, Web Crawling, Low Resource Languages
Subjects: General and reference > Document types > General conference proceedings
Information systems > World Wide Web > Web searching and information discovery > Web search engines > Web crawling
Date Deposited: 03 Aug 2023 06:14
Last Modified: 03 Aug 2023 06:14
URI: https://pubs.cs.uct.ac.za/id/eprint/1551

Actions (login required)

View Item View Item