A Management System for Integrated Querying of Web Databases

Berman, S. and C. Rouse (2008) A Management System for Integrated Querying of Web Databases. In van Brakel, P., Eds. Proceedings 10th Annual Conference of WWW Applications, University of Cape Town.

The World Wide Web has revolutionised information accessibility, but about 500 times larger than the Surface Web is the data which resides on databases connected to the Web, the so-called Deep Web or Hidden Web. While these databases can be accessed individually via query forms on their Web pages, we need architectures and tools that make it possible to query all relevant databases on the World Wide Web. This paper proposes such an architecture, based on a superpeer topology, and a set of intelligent tools to facilitate a more automated access to Web databases. While complete automation is unlikely in the near future, a suitable framework and toolkit is necessary to maximise automation and reusability, and to minimise the effort required by the human user. Our system is based on a peer-to-peer framework, and tools that exploit this topology in analysing, configuring and tuning components for a particular application domain, such as the travel and real estate domains. We describe a prototype implementation and evaluate its usability, performance and accuracy in the context of an initial experiment involving 32 independently-constructed databases. Components for automated query processing, schema matching, schema translation and data transformation are presented. The prototype system uses a mediated schema approach to integrated querying of multiple databases, and includes a query interface which allows results from different databases to be viewed together or in separate tabs as they are received from each data source. Our findings highlight the benefits of a semi-automated approach to integrated querying of the Deep Web.

