Mustafa Ali, Mohammed and Suleman, Hussein (2011) Building a Multilingual and Mixed Arabic-English Corpus, Proceedings of Arabic Language Technology International Conference (ALTIC) 2011, 9-10 October 2011, Alexandria, Egypt.
PDF
Final_corpus-.pdf Download (482kB) |
Abstract
Most currently available test collections and almost all CLIR collections have focused upon general-domain news stories. In addition, most of these corpora are built to help with retrieval of documents based on monolingual queries, even if they are translated. This paper presents the first-phase - building the corpus - of ongoing research to study the trends of multilinguality with special focus on Arabic/English multilingual texts in both queries and documents in scientific domains. The necessity of such a corpus would help a lot in providing good algorithms for Web searching of scholars in the Arabic World. The paper presented also the features of such corpus, how it is collected and how it has been validated in terms of terms frequencies, sparseness and vocabulary growth, using statistical tests. Results showed that the data is imbalanced at present.
Item Type: | Conference paper |
---|---|
Uncontrolled Keywords: | Multilingual Query, Corpus, Zip's Law, Heap's Law |
Subjects: | Information systems Information systems > Information retrieval |
Date Deposited: | 12 Dec 2011 |
Last Modified: | 10 Oct 2019 15:33 |
URI: | http://pubs.cs.uct.ac.za/id/eprint/747 |
Actions (login required)
View Item |