Pre-processing and Indexing - by Nkosana Malumba

The relationship between a query and a document is determined by the frequency of the terms contained in the query that the document has. However, as documents have to adhere to language constraints, a single word may have morphological variants and the string matching algorithms that are used by search engine will not match the possible variants. Therefore, by reducing the terms to their root form, the results returned by the search engine increases and thus improving the probability that the results will be relevant to the user's query. The two algorithms that were used in this process are the stemming algorithm and the morphological parser.

There are two main principles that are used in the development of stemming algorithm, the iteration and longest-match principle. The iteration principle is based on the fact affixes are attached to stems in a certain order using a predefined class of affixes. The algorithm simply removes the affixes from either start to end or end to start; based on which class the detected affix matches. The second principle, the longest match; states that within any given class of endings, if more than a single ending provides a match, the longest one should be removed from the word. The affixes used in the development of the stemming algorithm included prefixes from the noun classification system and nominal suffixes.

The morphological parser was primarily based on the affixes and word formation rules. In terms of the prefixal analysis, the scope focused on the noun classification and concordial systems. The noun classification system forms the basis of all prefixes and determines the types of concords that can be applied to a stem. These concords have different categories, which have different semantics that must be considered when analyzing a particular word. The research questions that were being investigated by the development of the pre-processing algorithms were:

Implementation

The stemming algorithm was developed based on existing algorithms, which are Porter's and Spanish Stemming algorithm. The algorithm for the stemming process is as follows:

stemming

The morphological parser was developed using approximations that were derived from the concordial system and noun classification rules. Although some overlaps are present between concordial types, it was presumed that these would not significantly affect the parsing of a word. For example, the concord “ba” can be either a plural prefix, a subject concord or an object concord. The morphological analysis algorithm is as follows:

suffixA suffixB

Evaluation

Two experiements were designed to evaluate the research questions that were set in the project proposal. The purpose of the first experiement was to evaluate the accuracy of the morphological parser as compared to the stemming algorithm given a language corpus. The second experiement required the pre-processing algorithms to be used in the search engines as a pre-processing step during indexing and retreival of information. The experiement made use a of user population of people literate in the language. The purpose of the experiement was to measure if the relevance of the results given a user’s query would increase by using a morphological parser in comparison to a stemming algorithm when indexing and querying results.

Results

The morphological parser produced 48% accuracy in which the result contained the stem or root word as compared to the 42% produced by the stemming algorithm. The morphological parser resulted in a higher accuracy in the reduction of words due to its sensitivity to the word formation rules and morphemes used in the derivation of inflected forms. This allows the morphological parser to deconstruct words using a set of predefine set of morphemes.

The use of a morphological parser in the indexing and querying of data resulted in a higher precision score as opposed to using a stemming algorithm. The mean precision of the morphological parser was found to be 0.138 as compared to the stemming algorithm with a precision score of 0.102. One key factor that contributed to this is that the morphological parser was able to reduce words to their root forms using their morphology as a guide. In the case of the stemming, a brute force stripping of suffixes and prefixes may have resulted in a phenomenon called understemming or overstemming. These processes usually result in the word being incorrectly stemmed due to the some of the morphemes either being incorrectly detected, or being totally omitted by the algorithm. Additionally, a positive correlation was found between the length of the query and the relevance of the documents. The relationship is show in the graph below. Bild