Language Identification and Focused Crawling - by Katlego Moukangwe

Motivation and Research questions

Language identification is central to a lot of pre-processing applications in information retrieval. Language identification is an essential pre-processing step for language processing techniques such as stemming, machine-translation, et cetera. Is it possible to develop a language identification system capable of identifying IsiZulu text or strings? As mentioned in the problem statement, there is not a lot of good IsiZulu corpora available on the Internet. The language identification system should performfairly accurately in order to be used in other research developments that require language identification such as part-ofspeech tagging, machine-translation, stemming, focused crawling,et cetera.

Is it possible to develop a focused crawler capable of identitying and harvesting IsiZulu documents on the Internet? Collecting IsiZulu documents is one step closer to building a IsiZulu search engine. An IsiZulu search engine will provide an easy way for locating IsiZulu literature, news or other forms of IsiZulu media that could otherwise be difficult to find in a general purpose search engine. AfriWeb attempts to answer the research questions highlighted.


Language Identification Implementation

Supervised classification is choosing the right label for a given input. The language model is suppose to decide, given a string if it is IsiZulu or non-IsiZulu. The language model was trained using the Ukwabelana corpus. A feature extractor is a tool to obtain the corpus ngram distribution. A classifier is what decides if input provided belongs to IsiZulu or non-IsiZulu classes. The figure above shows the final result expected after the langauge identification implementation.

The language model was trained using the Ukwabelana sentences corpus. Words in the training data were broken up into ngrams. An n-gram is an n character slice of a string. The langauge model was trained using the VariKN language modelling toolkit. Given the language model, a bayesian clasifier was used to decide if a string belongs to IsiZulu or non-IsiZulu. The bayesian classifier was written in Python.

Focused crawling Implementation

The crawling starts from a set of seed websites. The content of a page is loaded and then preprocessed so that the core information can be obtained. The preprocessing stage involves the removal of punctuation, HTML tags, e.t.c . The content is then forwarded to the classifier to decide if it is relevant or not. The classifier consists of the langugage model. If a page is relevant, all its links are visited to search for more relevant pages. The pseudocode is shown in the diagram below.


Bild

Evaluation and Results

The language Idenfication was tested with three data sets described below.

IsiZulu Contained 29423 variable length sentences with 100% Zulu words.
PEnglish Contained 28000 variable length sentences which had 100% English words.
PZplus Contained 25000 variable length sentences of which 40% was Zulu and 60% was English.
PItalian Contained 25215 variable length sentence with 100% Italian words.
The fraction of True-negatives, True-positives, False-positives and False-negatives were measured and the results are shown below.

Bild

To evaluate the language model, 10 randomly chosen documents were provided to 10 human subjects. They were required to classify how many of the selected documents contained Zulu. The total number of documents classified as Zulu was noted. Set One contained 3879 documents, Set Two contained 5825 documents and Set Three contained 64001 documents. The results are shown below.

Bild

The language model developed managed to provide a 98.4% accuracy. Answering the research question Is it possible to develop a language identification system capable of identifying IsiZulu strings? The language identification system developed is capable of identifying 98.4% of provided IsiZulu documents. It can be concluded that it is possible to develop a language identification system capable of accurately identifying IsiZulu strings. Is it possible to develop a focused crawler capable to find and download IsiZulu documents on the Internet? More than 60,000 IsiZulu documents were successfully crawled. From the results of this project it can be concluded that it is possible to develop a focused crawler capable of finding and downloading IsiZulu documents.