Preamble
We develop a semantic desktop search prototype, which enhances conventional full-text search with semantics and ranking modules. This prototype extracts and stores activity-based metadata explicitly as RDF annotations. Our main contributions are extensions we integrate into an open source desktop search engine to exploit this additional contextual information for searching and ranking the resources on the desktop. Contextual information plus ranking brings desktop search much closer to the performance of web search engines. Initially disconnected sets of resources on the desktop are connected by our contextual metadata,
PageRank? derived algorithms allow us to rank these resources appropriately.
Project Overview
Existing desktop search applications, trying to keep up with the rapidly increasing storage capacities of our hard disks, offer an incomplete solution for information retrieval. Desktop search could potentially profit from a lot of implicit and explicit semantic information available in emails, folder hierarchies, browser cache contexts and others. This project investigates how to extract and store these activity based context information explicitly as RDF metadata and how to use them, as well as additional background information and ontologies, to enhance desktop search. The goal is to enhance the search and query of semantic associated personal information by extracting and analyzing the user activity-based metadata.
A semantic desktop search system will enhance and contextualize desktop search based on semantic metadata collected from different contexts available and activities performed when the user interacts with information. There are three main working contexts of email exchanges, file procedures (i.e. create, modify, etc.), and web surfing. Context metadata can be generated automatically by activity event monitors while the user works. For example, when an email is received, the monitor automatically generates email RDF metadata, instantiating the sender, subject as well as the valuable comments inside its body, and associating them to the documents attached to this email. The system also annotates each cached web page with additional information both for its basic properties (URL, access date, etc.), as well as more complex ones such as the used in-going and out-going links to other neighboring pages, reflecting users surfing behavior. All of these metadata are exported in RDF format through desktop ontologies, and added to a metadata index, which is used by the search application together with the usual full-text index.
Project Background
The capacity of our hard-disk drives has increased tremendously over the past decade, and so has the number of files we usually store on our computer. Using this space, it is quite common to have over 100,000 indexable items on the desktop. It is no wonder that sometimes we cannot find a document anymore, even when we know we saved it somewhere. Ironically, in some of these cases nowadays, the document we are looking for can be found faster on the World Wide Web than on our personal computer. In view of these trends, resource organization in personal repositories has received more and more attention during the past years.
Desktop search falls short of utilizing desktop specific characteristics, especially context information. Some of these missed opportunities include:
- Email context is not utilized by the existing search algorithms, even though this clearly drops useful information. For example, one email might contain a question describing the object one is looking for, and another email in the same thread might include the answer to that question in the form of an attached document.
- Email attachments lose all contextual information as soon as they are stored on the PC, even though emails usually include additional information about their attachments, such as sender, subject, comments. We might discuss a paper with a colleague during a brainstorming session, and then afterwards send her the electronic version via email, together with a few helpful comments. After a while, our colleague might not remember details about the paper itself, but rather recall with whom she discussed it or which question was raised in the discussion and included as comment in the email. It would be helpful to find the stored paper not only based on its content, but also associatively based on that context.
- Folder hierarchies are barely utilized by the search algorithms, even though we might have spent considerable time to build sophisticated classification hierarchies for the documents we store. For example, pictures taken in Hanover are probably stored in a directory entitled ”Germany”, “Lower Saxony” or ”Hanover”, and it would be nice if we could utilize this information when we search for the pictures.
- Browser caches include all information about user’s browsing behavior, which are useful both for finding relevant results (for example, if we remember how to find the project’s home page, but not the corresponding API specification), and for providing additional context for results. It would also be very useful if our search application not only returns one specific scientific paper we downloaded from the CiteSeer? repository, but all the referenced and referring papers which we downloaded on that occasion as well.
As studies have shown that people tend to associate things to certain contexts, all this information should be utilized during search. So far, however, neither has this information been collected, nor have there been attempts to use it.
In this project we discuss how to enhance and contextualize desktop search based on semantic metadata collected from different contexts available and activities performed on a personal computer. We explore three important contexts: electronic mail, folder hierarchies, and web cache. Analogously, other contexts might be exploited as well. We describe the semantics of these different contexts by appropriate ontologies and show how to extract and represent the corresponding context information as RDF metadata which can be used by a search application together with a full text index of our documents.
Comparing the possibilities for a semantic desktop search environment to semantic search on the web, we believe that semantic web technologies might ultimately be more important on the desktop than on the web. This is because, first, our desktop environment is “limited” in the sense that we will be able to describe most relevant contexts rather easily, and thus will be able to provide more complete ontologies / metadata specifications for the desktop environment than for the web in general. Second, even with 200GB hard disks in our computers, the amount of data and metadata itself is limited compared to the information available on the web, so more sophisticated algorithms for using semantic annotations are feasible on the desktop than on the web.
Architecture

Figure 1 : Architecture of Semantic Desktop Search System
Research Content
Desktop links analysis from user activities
Web search has become more efficient than PC search due to the powerful link based ranking solutions like
PageRank? . The recent arrival of desktop search applications, which index all data on a PC, promises to increase search efficiency on the
desktop. However, even with these tools, searching through our (relatively small set of) personal documents is currently inferior to searching the (rather vast set of) documents on the web. Indeed, desktop search engines are now comparable to first generation web search engines, which provided full-text indexing, but only relied on textual information retrieval algorithms to rank their results.
Desktop ranking is hindered by the lack of links between documents, an important source of evidence for current web ranking algorithms. To alleviate this deficiency we propose to connect semantically related desktop items by analyzing user’s activity patterns, as well as her local resource organization structures. We investigate and evaluate in detail the possibilities to translate this information into a desktop linkage structure, and we propose several algorithms that exploit these newly created links in order to efficiently rank desktop items. The access based links lead to ranking results comparable with TFIDF ranking, and significantly surpass
TFxIDF? when used in combination with it, making them a very valuable source of input to desktop search ranking algorithms.
We create links between desktop resources when some specific desktop usage activity is encountered (e.g., the attachment of an email is saved as a file, or a web page is stored locally, etc.). There exists a plethora of some other cues for inferring desktop links, most of them being currently unexplored by previous work. We discover four kinds of explicit and implicit links between resources according to user’s activity context:
- Same-task relation:
- Task: a series of user activities with a specific goal
- Work-oriented task: doc, ppt, xsl, pdf, IE webs visited, emails
- Methods to find tasks:
- Central document and active time
- Clustering documents by time intervals
- Save-as relation: (e.g., the attachment of an email is saved as a file, or a web page is stored locally, etc.)
- Web paged -> files
- Email attachments -> files
- Doc/PPT <-> PDF
- Same types of files
- Similar-to relation
- Based on facets
- Based on contents
- Based on file names
- Based on locations (same directory)
- Copy-from relation
- File copy
- Content copy (text or image)
For example the files stored within the same directory have to some extent something in common, especially for filers, i.e., users that organize their personal data into carefully selected hierarchies. Similarly, files having the same file name (ignoring the path) are in many times semantically related. In this case however, each name should not consist exclusively of stopwords. More, for this second additional heuristic we had to utilize an extended stopword list, which also includes several very common file name words, such as “index”, or “readme”. In total, we appended 48 such words to the original list. Finally, we note that both these above mentioned approaches favor lower sets: If all files within such a set (e.g., all files residing in the same directory) are linked to each other, then the stationary probability of the Markov chain associated to this desktop linkage graph is higher for the files residing in a smaller set. This is in fact correct, since for example a directory storing 10 items has most probably been created manually, thus containing files that are to some extent related, whereas a directory storing 1,000 items has in most of the situations been generated automatically. Also, since these sub-graphs of the main desktop graph are cliques, several computational optimizations are possible; however, in order to keep our algorithms clear we will not discuss them here. Other source of linkage information is file type. There is clearly a connection between the resources sharing the same type, even though it is a very small one. Unfortunately, each such category will nowadays be filled with up to several thousands of items (e.g., JPG images), thus making this heuristic difficult to integrate into the ranking scheme. A more reliable approach is to use text similarity to generate links between very similar desktop resources. Likewise, if the same entity appears in several desktop resources (e.g., Hannover appears both as the name of a folder with pictures and as the subject of an email), then we argue that some kind of a semantic connection exists between the two resources. Finally, we note that users should be allowed to manually create links as well, possibly having a much higher weight associated to these special links.
Ontology

Figure 2: Ontology of Semantic Desktop Search System
Ranking
Existing desktop search engines are now comparable to first
generation web search engines, which provided full-text indexing,
but only relied on textual information retrieval algorithms to
rank their results. However, it is hard to use term frequency
scores to describe the importance of a document. Therefore, with
poor ranking, results could not reflect the user's personal
preference.
Desktop ranking is hindered by the lack of links between
documents, an important source of evidence for current web ranking
algorithms. To alleviate this deficiency we reconstruct
semantically associative links by analyzing user's activity
patterns and contexts. Then a new ranking scheme is formed based
on such link structures. The access based links lead to ranking
results comparable with TFIDF ranking, and significantly surpass
TFIDF when used in combination with it, making them a very
valuable source of input to desktop search ranking algorithms.
Based on the link structure of resource, the ranking scheme
combines TFIDF with
PageRank? algorithm, at the same time
considering user preference through their access frequency. The
computation of ranking is similar to
ObjectRank method used in the
context of keyword searching in traditional databases, but based
on the desktop ontologies we presented before. Two most important
concepts related to
ObjectRank algorithm could be the
Authority Transfer Schema Graph and
Authority Transfer Data Graph.
The ATSG used in this system is shown in Figure 3.
They are simply edge-weighted directed
graphs with a
backward edge (black edges in Figure 3) added for each
forward
edge (red edges in Figure 3) in the original graph representing relationships between
different desktop data source. *Authority transfer schema
graph* can express how importance propagates among the entities
and resources inside the ontology. These weights and edges
represent the authority transfer annotations, which extend our
context ontologies with the information we need to compute ranks
for all instances of the classes defined in the context
ontologies. Figure describes the links among desktop resources we
construct from user activity context. A number of annotation
ontologies are used to describe the relationships among the
resources and thus influence the rankings. With all desktop
resources linked to each other,
ObjectRank? computation becomes
very simple. The computation is based on the
PageRank? formula:
$ r = d · A · r + (1 . d) · e$
applying the random surfer model and including all nodes in the
base set. The random jump to an arbitrary resource from the data
graph is modeled by the vector
e.
A is the adjacency matrix
which connects all available instances of the existing context
ontology on one's desktop. However, in order to model personal
preferences through user access frequency, we modify
e by
assigning different weights to different data sources according to
their access frequency.
The new ranking scheme we developed benefits both from the
advantages of Lucene's TFIDF score and those of
ObjectRank? . The
new scores are computed as a combination of them using the
following formula:
R'(a) = R(a)·TFIDF(a)
where
a represents the resource, R(a) is the computed
ObjectRank? , TFIDF(a) is the TFIDF score for resource a and
R0(a) is the resulting score. The formula guarantees that the
highest ranked resources have both a high TFIDF and a high
ObjectRank? score. The re-ranking is performed at query time.

Figure 3: Authority Transfer Schema Graph
Project Implementation
PIM Implementation?