UCT CS Research Document Archive

Building a Multilingual and Mixed Arabic-English Corpus

Mustafa Ali, Mohammed and Hussein Suleman (2011) Building a Multilingual and Mixed Arabic-English Corpus. In Proceedings Arabic Language Technology International Conference (ALTIC) 2011, Alexandria, Egypt.

Full text available as:
PDF - Requires Adobe Acrobat Reader or other PDF viewer.


Most currently available test collections and almost all CLIR collections have focused upon general-domain news stories. In addition, most of these corpora are built to help with retrieval of documents based on monolingual queries, even if they are translated. This paper presents the first-phase - building the corpus - of ongoing research to study the trends of multilinguality with special focus on Arabic/English multilingual texts in both queries and documents in scientific domains. The necessity of such a corpus would help a lot in providing good algorithms for Web searching of scholars in the Arabic World. The paper presented also the features of such corpus, how it is collected and how it has been validated in terms of terms frequencies, sparseness and vocabulary growth, using statistical tests. Results showed that the data is imbalanced at present.

EPrint Type:Conference Paper
Keywords:Multilingual Query, Corpus, Zip's Law, Heap's Law
Subjects:H Information Systems: H.1 MODELS AND PRINCIPLES
ID Code:747
Deposited By:Suleman, Hussein
Deposited On:12 December 2011