Metadata Mapper Component

What follows is a brief summary of the Automatic and Manual Metadata Mapper component of this project. A full report can be downloaded by clicking here.

Introduction

This tool aims to assist users when adding data to DSpace by attempting to automatically map the fields of the legacy system to that of the metadata fields used by DSpace. Machine learning was used to try to predict which Dublin Core metadata field a given entry should be classified as. Five different machine learning algorithms were selected and compared to determine the best performing algorithm for this task. This tool also allows for data to be added to an existing DSpace repository, and the metadata mappings can be saved for future use. It is accessed through a Web-based user interface that also allows for the user to review and correct the attempted automatic metadata mappings.

Experiment Design and Execution

Overview

In order to compare the performance of the chosen machine learning algorithms, Weka (A Java based machine learning framework) was used to train and test the selected algorithms. Training data was gathered from a variety of open access repositories and a simple Java application was then written to extract numerical features from this training data and generate the input file that is used by Weka for training and testing. Cross-validation, as well as an unseen data set were used to compare the performance of the various algorithms.

Training the machine learning algorithms

Training data was collected from various open access repositories and consisted of 32813 records. This data was sourced from 10 different repositories and care was taken to ensure that the data was representative and of a high quality.

Selected algorithms

Five machine learning algorithms were selected for evaluation based on popularity and past performance in similar tasks and the default Weka parameters for these algorithms were used for testing and training. The following five algorithms were chosen:

Performance of Machine Learning Algorithms

Both cross-validation and unseen data were used to evaluate the performance of the selected algorithms. A 10 fold, 10 iteration cross-validation was performed. The unseen data consisted of 109993 records from a NRF database containing research related data.

Cross-Validation results

Table 1 shows the cross-validation performance and standard deviation of the five selected algorithms. Random Forest performed statistically better than all the other algorithms.

Table 1: Cross-validation results
Algorithm Percentage Correct Standard Deviation
Random Forest 94.28 0.38
J48 (C4.5 decision tree) 93.65 0.37
Logistic Regression 79.04 0.62
Artificial Neural Networks 76.59 0.91
Naive Bayes 54.92 0.71

Unseen Data Classification Results

Table 2 shows the results when the trained algorithms were given unseen data (from the NRF). Once again, both Random Forest and C4.5 were the best performing algorithms, with C4.5 performing slightly better.

Table 2: Unseen NRF data set results
Algorithm Percentage Correct
J48 (C4.5 decision tree) 79.54
Random Forest 68.08
Logistic Regression 57.69
Artificial Neural Networks 55.59
Naive Bayes 33.91

UI Design

While this project aimed to take a fairly experimental approach to development and analysis, the system being developed needed to provide a front-end, Web-based, UI to access the features so that organisations such as the NRF could use it. As such, it was important to ensure that the system was developed to be of suitable quality for a production release.

Figure 1 shows the screen presented to a user that allows them to upload a CSV input file. The CSV input file uploaded via this UI is then parsed by iteratively classifying each entry. The user can also elect to save the metadata mapping for future use, or use a previously saved mapping.

Upload screen to upload CSV input file
Figure 1: UI of submission page for CSV file.

Figure 2 shows a page that contains the results of an automatic metadata mapping. Here the algorithm was able to correctly classify all three fields, with the user only having to specify the secondary field of the Dublin Core ‘date’ field. Here, 84% of the entries in the field ‘Paper title’ were correctly classified as ‘title’.

Review screen to view and correct metadata mappings
Figure 2: Results page that allows the user to review and correct a metadata mapping.

Software Usability and Acceptance

It was important that the tool being developed provided a usable and effective interface, and that it met the initial requirement of the NRF.

System Usability Testing

As the system being developed required a front-end design to interact with the system, it was imperative that the UI was easy to use and intuitive. In order to test the tool’s usability, a standard usability test was used, the System Usability Scale (SUS).

SUS was developed to try to represent the overall usability of a system through a single number, ranging from 0 to 100, with 100 being a ‘perfect’ score. The raw data, as well as the mode responses, did not appear to indicate any particular usability issues. The overall SUS scored achieved, was a very acceptable 84.

Acceptance of the Tool by the NRF

The NRF indicated that the metadata mapper met their requirements and that they were pleased with the results. They did however indicate that it would be useful if custom Dublin Core fields could be used, instead of being limited to the standard DSpace Dublin Core fields. The NRF usability survey results were positive, with the only concern being that they felt they had to learn a lot of things before using this system. The overall NRF SUS score achieved was 90.

Conclusions and Future Work

The metadata mapper proved to be a highly usable and effective tool for performing data migrations and imports into DSpace. The decision tree based algorithms were found to be very effective at classifying the data into the correct Dublin Core metadata fields. Future work in this field could look at providing users with the ability to map data to custom metadata fields. It would also be interesting to investigate the performance of other machine learning algorithms and features.

Supplementary Material

The following supplementary materials relate to those referenced in the Metadata Mapper report and are provided in the order in which they are referenced.

Section Document File format
1.2 - Project Aims PDF
1.2 - Project Aims Excel - XLS
1.2 - Project Aims Excel - XLS
2.1 - DSpace PDF
5.1.2 - Initial Requirements Survey
Same as 1.2
5.3.2.1 - Initial Paper Prototype PDF
5.3.2.1 - Initial Paper Prototype PDF
5.3.3 - Testing, Documentation and Maintainability ZIP archive
5.3.3 - Testing, Documentation and Maintainability ZIP archive
6.1 - System Usability Testing PDF
6.1 - System Usability Testing Excel - XLSX
6.3 - Acceptance of the Tool by the NRF PDF
6.3 - Acceptance of the Tool by the NRF ZIP archive
6.3 - Acceptance of the Tool by the NRF PDF
6.3 - Acceptance of the Tool by the NRF PDF