Marquard, Cael and Suleman, Hussein (2023) Focused Crawling for Automated IsiXhosa Corpus Building, Proceedings of SAICSIT 2023: South African Institute of Computer Scientists and Information Technologists, 2023, Muldersdrift, South Africa, Communications in Computer and Information Science, 1878, 19-31, Springer Nature Switzerland.
Text (Preprint of the paper "Focused Crawling for Automated IsiXhosa Corpus Building")
Focused Crawling for Automated IsiXhosa Corpus Building.pdf - Submitted Version Download (296kB) |
Abstract
IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corpora. This makes it difficult to perform many kinds of research with the language. This paper examines the use of focused Web crawling for automatic corpus generation. The resulting corpus is characterised using statistical methods: its vocabulary growth has been found to fit Heaps’ Law, and its word frequency has been found to be heavy-tailed. In addition, as expected, the corpus statistics did not match expectations from non-agglutinative languages
Item Type: | Conference paper |
---|---|
Uncontrolled Keywords: | Corpus, IsiXhosa, Web Crawling, Low Resource Languages |
Subjects: | General and reference > Document types > General conference proceedings Information systems > World Wide Web > Web searching and information discovery > Web search engines > Web crawling |
Date Deposited: | 03 Aug 2023 06:14 |
Last Modified: | 03 Aug 2023 06:14 |
URI: | https://pubs.cs.uct.ac.za/id/eprint/1551 |
Actions (login required)
View Item |