Infiniti: Deep Web Entity Retrieval

Search technology has become a key component of our daily life. The ability to locate relevant documents has become a standard commodity. This project takes search beyond merely locating documents. Entities and events are seen as the most salient ingredients of textual and audiovisual content; they represent valuable knowledge ingredients that are heavily searched for, and conveniently serve as the linking pin between heterogeneous content sets. Semantic search is structured around entities, their properties, relations, developments as well as pertinent attitudes and emotions. The project takes on the challenge of building semantic search technology, with a special focus on issue management algorithms and tools, i.e., for identifying entities, issues, naming stakeholders, capturing attitudes and emotions, and monitoring relevant events and changes in highly dynamic environments.

WP7 Deep Web Entity Retrieval

To open up deep web sources for entity retrieval by identifying the entity types a web source provides, by allowing natural (text search) access to structured data, and by combining entities from diverse deep web sources. WP 7 aims to provide possibilities to share and search information, without the need for the service provider (WCC) to crawl all data.

Recent research in entity retrieval has resulted in effective entity ranking if the task is well- defined as in expert search, or if the data is well-organized as, e.g., in Wikipedia. Well-defined and well- organized data is increasingly available on the web—the prime example being the deep web. The deep web is a large part of the web that cannot be accessed by crawlers: mostly dynamic web pages that are returned in response to a web form. The objectives are met in four steps.

Personal information sharing evaluates a prototype content management and search system based on WCC's matching system ELISE at UT. Deep web entity probing identifies what types of entity a deep web service provides by probing queries. A challenge is to identify exactly what entity types a web service specializes in, but general types are of interest as well, for instance “persons.” Database natural abstraction layers opens up a (deep) web service by returning dynamic pages based on text queries or natural language questions, combining closed-domain question answering approaches with open-domain approaches, so as to identify question patterns that a deep web source may answer and automatically translate questions to web forms.

Entity search aggregation will combine deep web information from several sources in a unified search result. This work will focus on the use of standards like OpenSearch and on extensions to support deep web entity retrieval. In the absence of standardized results we investigate wrapper induction techniques for information extraction.