Building a Multilingual and Mixed Arabic-English Corpus
Mustafa Ali, Mohammed and Hussein Suleman (2011) Building a Multilingual and Mixed Arabic-English Corpus. In Proceedings Arabic Language Technology International Conference (ALTIC) 2011, Alexandria, Egypt.
Most currently available test collections and almost all CLIR collections have focused upon general-domain news stories. In addition, most of these corpora are built to help with retrieval of documents based on monolingual queries, even if they are translated. This paper presents the first-phase - building the corpus - of ongoing research to study the trends of multilinguality with special focus on Arabic/English multilingual texts in both queries and documents in scientific domains. The necessity of such a corpus would help a lot in providing good algorithms for Web searching of scholars in the Arabic World. The paper presented also the features of such corpus, how it is collected and how it has been validated in terms of terms frequencies, sparseness and vocabulary growth, using statistical tests. Results showed that the data is imbalanced at present.
|EPrint Type:||Conference Paper|
|Keywords:||Multilingual Query, Corpus, Zip's Law, Heap's Law|
|Subjects:||H Information Systems: H.1 MODELS AND PRINCIPLES|
H Information Systems: H.3 INFORMATION STORAGE AND RETRIEVAL
|Deposited By:||Suleman, Hussein|
|Deposited On:||12 December 2011|