UCT CS Research Document Archive

Integrated Query of the Hidden Web

Berman, S, M Kamkuemah and J Muntunemuine (2009) Integrated Query of the Hidden Web. In Van Brakel, P.A., Eds. Proceedings 11th Annual Conference on World Wide Web Applications, pages 4-15, Port Elizabeth, South Africa.

Full text available as:

Abstract

There is a need for software that can access multiple Websites through a single, common interface. This would allow users, for example, to compare flights for a particular trip across all relevant airline sites by posing a single query. This paper investigates automating this process in the case of airline databases hidden behind the Web (the so-called Deep Web or Hidden Web). We first constructed a prototype for integrated query of a handful of pre-determined airline sites. This proved useful in detecting commonalities and differences in the sites, and in selecting the most suitable technologies for working with multiple forms. A generic system was then designed and components of the prototype incrementally replaced by domain-specific tools able to handle arbitrary airline sites. Our results were promising as regards result interpretation, with 89% of response pages successfully handled. However query formulation presented many problems, with only 39% of query interfaces automatically interpreted correctly, and even fewer amenable to automated query propagation. We conclude that integrated access to the Hidden Web is considerably more challenging than crawling the Surface Web, and that domain-specific systems are a promising approach to full automation.

EPrint Type:Conference Paper
Subjects:H Information Systems: H.3 INFORMATION STORAGE AND RETRIEVAL
ID Code:583
Deposited By:Berman, Sonia
Deposited On:06 December 2009