Building a Multilingual and Mixed Arabic-English Corpus

Mustafa Ali, Mohammed and Suleman, Hussein (2011) Building a Multilingual and Mixed Arabic-English Corpus, Proceedings of Arabic Language Technology International Conference (ALTIC) 2011, 9-10 October 2011, Alexandria, Egypt.

[img] PDF
Final_corpus-.pdf

Download (482kB)

Abstract

Most currently available test collections and almost all CLIR collections have focused upon general-domain news stories. In addition, most of these corpora are built to help with retrieval of documents based on monolingual queries, even if they are translated. This paper presents the first-phase - building the corpus - of ongoing research to study the trends of multilinguality with special focus on Arabic/English multilingual texts in both queries and documents in scientific domains. The necessity of such a corpus would help a lot in providing good algorithms for Web searching of scholars in the Arabic World. The paper presented also the features of such corpus, how it is collected and how it has been validated in terms of terms frequencies, sparseness and vocabulary growth, using statistical tests. Results showed that the data is imbalanced at present.

Item Type: Conference paper
Uncontrolled Keywords: Multilingual Query, Corpus, Zip's Law, Heap's Law
Subjects: Information systems
Information systems > Information retrieval
Date Deposited: 12 Dec 2011
Last Modified: 10 Oct 2019 15:33
URI: http://pubs.cs.uct.ac.za/id/eprint/747

Actions (login required)

View Item View Item