Storage and Indexing in PDSMS

The storage and indexing in a PDSMS will have the following goals:

  • to create efficiently queryable associations between data objects in different data sources,
  • to improve accesses to data sources that have limited access patterns,
  • to enable answering certain queries without accessing the actual data source, and
  • to support high availability and recovery.

The key challenges involved in building the local store and indexing of a PDSMS have to do with the heterogeneity of the index. The index needs to be highly adaptive to heterogeneous environments. The index should uniformly index all possible data items, whether they are words appearing in text, values appearing in a database, or a schema element, XML file tag in one of the sources. It takes as input any token appearing in the personal dataspace and return the locations at which the token appears and the roles of each occurrence. The index can identify information across data sources when certain tokens appear in multiple ones (in a sense, a generalization of join index). Typically, we may want to build special indexes for this purpose for a certain set of tokens. In addition, the index needs to consider multiple ways of referring to the same real world object, e.g., different ways to refer to a company or person. (Note that so far, research on reference reconciliation has focused on detecting when multiple references are about the same object). Keeping the index up to date will be tricky, especially for data sources that do not have mechanisms to notify it of updates.

Furthermore, deciding which portions to cache in the local store and which indexes to build raises several interesting challenges in automated tuning. We may want to cache certain personal dataspace fragments (vertical or horizontal) for several purposes including:

  • to build additional indexes on them for supporting more efficient access,
  • to increase availability of data that is stored in data sources that may not be reliable, and
  • to reduce the query load on data sources that cannot allow ad-hoc external queries.
Topic revision: r1 - 12 Dec 2007 - 05:47:56 - Main.Jidong
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback