This is from a paper published in ACM SIGMOD Record December 2005 authored by Michael Franklin (University of California, Berkeley), Alon Halevy (Google Inc. and U. Washington) and David Maier (Portland State University). A new abstract - dataspace is introduced as a new agenda for information management. Here, we summarize the features of dataspace and compare the dataspace management system with database management system. Personal Dataspace as the main application of dataspaces will become the point of personal information management.
A Database Management System (DBMS) is a generic repository or the storage and querying of structured data. A DBMS offers a suite of interrelated services and guarantees that enables developers to focus on the specific challenges of their applications, rather than on the recurring challenges involved in managing and accessing large amounts of data consistently and efficiently.
Unfortunately, in data management scenarios today it is rarely the case that all the data can be fit nicely into a conventional relational DBMS, or into any other single data model or system. Instead, developers are more often faced with a set of loosely connected data sources and thus must individually and repeatedly address low-level data management challenges across heterogeneous collections. These challenges include: providing search and query capability; enforcing rules, integrity constraints, naming conventions, etc.; tracking lineage; providing availability, recovery, and access control; and managing evolution of data and metadata.
Such challenges are ubiquitous they arise in enterprises (larger or small): within and across government agencies, large science related collaborations, libraries (digital or otherwise), battlefields, in smart homes, and even on one's PC desktop or other personal devices.
A
DataSpace? Management System (DSMS) offers a suite of interrelated services and guarantees that enables developers to be freed from having to continually re-implement basic data management functionality when dealing with complex, diverse, interrelated data sources, much in the same way that traditional DBMSs provide such leverage over structured relational databases. Unlike a DBMS, however, a DSMS does not assume complete control over the data in the dataspace. Instead, a DSMS allows the data to be managed by the participant systems, but provides a new set of services over the aggregate of the systems, while remaining sensitive to their requirements for autonomy.
The table shows the differences between a DSMS and a DBMS.
{| border="1"
|-
|
DSMS ||
DBMS
|-
| All data (structured & unstructured)|| Structured data
|-
| Loosely Coupled|| Full control of data
|-
| Open, Data Co-existence|| Closed, Centralized Control
|-
| Scheme Later|| Scheme First
|-
| Pay-as-you-go (best-effort query)|| accurate query
|-
| Data association|| Data evolution Data stability
|-
| Context awareness|| N/A
|}
A DSMS must deal with data and applications in a wide variety of formats accessible through many systems with different interfaces. A DSMS is required to support all the data in the dataspace not only the structured data but also unstructured data rather than leaving some out, as with DBMSs.
Although a DSMS offers an integrated means of searching, querying, updating, and administering the dataspace, often the same data may also be accessible and modifiable through an interface native to the system hosting the data. Thus, unlike a DBMS, a DSMS is not in full control of its data.
A DSMS is not a data integration approach; rather, it is more of a data co-existence approach. The goal of DSMS is to provide base functionality over all data sources. For example, a DSMS can provide keyword search over all of its data sources, similar to that provided by existing desktop search systems. When more sophisticated operations are required, such as relational-style queries, data mining, or monitoring over certain sources, then additional effort can be applied to more closely integrate those sources in an incremental, “pay-as-you-go” fashion.
Queries to a DSMS may offer varying levels of service, and in some cases may return best-effort or approximate answers. For example, when individual data sources are unavailable, a DSMS may be capable of producing the best results it can, using the data accessible to it at the time of the query.
Unlike a DBMS that focuses on data stability, a DSMS pays more attentions to data associations and evolutions. It can help user to query by association, returning not only objects whose attribute values contain the required keywords, but also objects that are strongly related to multiple such objects. A DSMS can also manage evolution of data and metadata.
A dataspace should contain all of the information relevant to a particular organization regardless of its format and location, and model a rich collection of relationships between data repositories. Hence, a DSMS models a dataspace as a set of participants and relationships. The participants in a dataspace are the individual data sources: they can be relational databases, XML repositories, text databases, web services and software packages. They can be stored or streamed (managed locally by data stream systems), or even sensor deployments. A DSMS should be able to model any kind of relationship (e.g., schema mappings, replicas, containment relationships) between two (or more) participants. On the more traditional end, a DSMS should be able to model that one participant is a view or a replica of another, or to specify a schema mapping between two participants. Relationships may be even less specific, such as that two datasets came from the same source at the same time.
Comments