TWiki> Daoli Web>SearchAndQueryInPDSMS (09 May 2008, Main.Admininistrator)EditAttach

Search and Query in PDSMS

In personal dataspaces, users should be able to query any data item regardless of its format or data model. Queries can be posed in a variety of languages (and underlying data models) and should be reformulated into other data models and schemas as best possible, leveraging exact and approximate semantic mappings. Ideally, a PDSMS should support the following search and querying.

  • Universal Search and Querying :

Keyword search has the property that it is more forgiving than a query, based on similarity and providing ranked results to end users. It is especially suitable in a dataspace environment, as users typically do not know all the disparate underlying structure. Recently the database community has studied how to do keyword search on structured data such as relational data or XML data. However, providing unified search and seamless querying is far beyond supporting keyword search on strucutred and unstructured data: the system needs to identify the data sources that are relevant to the query, it needs to provide a meaningful ranking for answers from different data sources, and it needs to be able to answer structured queries on unstructured data. Furthermore, structured querying is often too strict in requiring detailed knowledge of the underlying schemas, whereas keyword search is often inadequate for sophisticated users who wish to specify structural requirements. To improve users' querying experience, a PDSMS needs to support new kinds of querying paradigms that combine structured and unstructured querying in a fundamental way. For example, it should enable a user to specify a keyword query and retrieve data from all relevant data sources and iteratively refine the query to a structured query when appropriate. Furthermore, for both types of queries, we emphasize returning possibly related data in answers to queries rather than only the data that strictly satisfy the query.

Towards providing a universal search and querying service, it is interesting to study the following immediate problems:

(a) query routing : given a keyword query, detect the users intention and find the sources that are most relevant to the query;

(b) query reformulation : given a keyword query and a relevant structured source, reformulate the query according to the schema of the source;

(c) result ranking : given a set of answers obtained from both structured and unstructured sources, rank them according to multiple criteria such as the relevance of the answers, the details of the information, and the authority of the sources;

(d) query refinement : given a keyword query or a reformulated structured query, help the user refine it to obtain better query results.

  • Best-effort Querying :

For personal dataspace, the precise schema mappings between data sources are hard to generate. Integration in PDSMS is a schema-later and ongoing process without defining a global scheme. The process starts with disparate data sources, and incrementally improves the semantic mappings on an as-needed basis. The data sources are increasingly integrated over time, thereby improving the ability to share data between them. At any point during the ongoing integration, queries should be answered as best possible using the available and inferred semantic relationships.

Queries in PDSMS may offer varying levels of service, and in some cases may return best-effort or approximate answers. For example, when individual data sources are unavailable, a PDSMS may be capable of producing the best results it can, using the data accessible to it at the time of the query. Initially, a PDSMS should support keyword queries on any data sources similar to that provided by existing desktop search systems. As we gain more information about a data source or integrate those sources in an incremental, pay-as-you-go fashion, we should be able to gradually support more sophisticated queries. The system should support graceful transition between keyword querying, browsing and structured querying. In particular, when answers are given to a keyword (or structured) query, additional query interfaces should be proposed that enable the user to refine the query.

  • Meta-data Queries

The PDBMS should support a wide spectrum of meta-data queries. These include:

(a) Query by association : specifying which other data items in the dataspace may depend on or associate with a particular data item, e.g., find the email that John sent me the day I came back from Hawai, or retrieve the experiment files associated with my SIGMOD paper this year. It differs from traditional keyword search in that it also explores associations between data items, and so it leverages additional structure that may exist in the data or may have been automatically discovered. For example, searching for PIM returns not only the papers and presentations that mention the PIM project, but also people working on PIM and conferences in which PIM papers have been published. In addition, it is able to support hypothetical queries (i.e., What would change if I removed data item X?).

(b) Query about source : including the source of an answer or how it was derived or computed, and the degree of uncertainty about the answers, e.g., find all the papers where I acknowledged a particular grant, find all the experiments run by a particular student, or find all spreadsheets that have a variance column. A PDSMS should also support queries locating data, where the answers are data sources rather than specific data items. For example, the system should be able to answer a query such as: Where can I find data about IBM? or What sources have a salary attribute? Similarly, given an XML document, one should be able to query for XML documents with similar structures, and XML transformations that involve them. Finally, given a fragment of a schema or a web-service description, it should be possible to find similar ones in the personal dataspace.

(c) Query about timestamps and lineage : providing timestamps or lineage on the data items that participated in the computation of an answer. The time is useful to find resources that were touched about the same time as the query result. Geographic locations could be used to attach information to maps, e.g., assigning pictures from the users last holiday to his travel itinerary. Lineage refers to the history of data transformations that originated a given data item. Users may be interested in obtaining previous versions of a given query result, e.g., to see how a project proposal looked like one month ago. Furthermore, it may also be interesting to understand how a data item was created. If a data item a was copied from a data item b, then previous versions of b may be of interest. For example, an error in a project proposal a may have been caused by an error in the proposals template b.

(d) Context Queries : providing context to enable browsing and further exploration of query results. This means that it is a common pattern to query the neighborhood (or context) of objects returned from a previous query. Processing these kinds of queries is a challenging task. One alternative to speed-up such queries is to keep their results materialized in a special index structure.

(e) Query "second level content" : searching second level content (e.g. notes or comments) related to original data items.

Comments

 
Topic revision: r2 - 09 May 2008 - 13:19:29 - Main.Admininistrator
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback