Technical Report No. 136 - Abstract
W. May, G. Lausen
Information Extraction from the Web
The goal of information extraction from the Web is to provide an integrated view on data from autonomous, heterogeneous information sources. The main problem with current wrapper/mediator approaches is that they rely on very different formalisms and tools for wrappers and mediators, thus leading to an "impedance mismatch" between the wrapper and mediator level. Additionally, most approaches nowadays are restricted to access information only from a fixed set of sources. On the other hand, generic Web querying approaches are restricted to pure syntactical and structural queries and do not deal with semantical issues.
In this paper, we discuss an integrated architecture for Web exploration, wrapping, mediation, and querying. Our system is based on a unified framework - i.e., data model and language - in which all tasks are performed. We regard the Web and its contents as a unit, represented in an object-oriented data model: the Web structure, given by its hyperlinks, the parse-trees of Web pages, and its contents are all included in the internal world model of the system. The advantage of this unified view is that the same data manipulation and querying language can be used for the Web structure and the application-level model: The model is complemented by a rule-based object-oriented language which is extended by Web access capabilities and structured document analysis. Thus, accessing Web pages, wrapping, mediating, and querying information can be done using the same language.
This integration also allows for data-driven Web exploration which is independent from a given network of individual predefined wrappers and mediators. Thus, in addition to the classical wrapper and mediator functionality, a system with this architecture can be equipped with Web navigation and exploration functionality. Queries to existing Web indexing and searching engines can also be integrated.
In particular, we present a methodology for reusing generic rule patterns for typical extraction, integration, and restructuring tasks using this framework. In an abstract sense, the system contains a universal wrapper, which can be applied to arbitrary Web pages that the system learns about during information processing. Equipped with suitably intelligent rules, the system can potentially explore initially unknown parts of the Web, thus coping with the steady growth of the Web.
We show the practicability of our approach by using the FLORID system. The approach is illustrated by two case-studies.
Report No. 136 (PostScript)