Monday, March 23, 2009

How MrTaggy is implemented...

A short time ago, we announced the MrTaggy browsing and searching engine for social bookmarks here. One of the neat features of this system is its relevance feedback mechanism which enables users to click on keywords to navigate toward the information that they are interested in.

The overall system uses a sophisticated MapReduce computation in the backend, and the implementation is non-trivial. Here is how it works. The diagram below was recently published in an IEEE Computer Magazine article, and it roughly describes how the data flows thru the whole system. (Click on it to enlarge it.)



First, a crawling module goes out to the web and crawls social tagging sites, looking for tuples of the form . Tuples are stored in a MySQL database. In our current system, we have roughly 150 million tuples.

A MapReduce system based on Bayesian inference and spreading activation then computes the probability of each URL or tag being relevant given a particular combination of other tags and URLs. Here we first construct a bigraph between URLs and tags based on the tuples and then precompute spreading activation patterns across the graph.

To do this backend computation in massively parallel way, we used the MapReduce framework provided by Hadoop (hadoop.apache org). The results of this computation are stored in a Lucene index so that we can make the retrieval of spreading activation patterns as fast as possible.

Finally, a web server serves up the search results through an interactive frontend. The frontend responds to user interaction with relevance feedback arrows by communicating with the web server using AJAX techniques and animating the interface to an updated state.

Reference:
Ed H. Chi, "Information Seeking Can Be Social," IEEE Computer, vol. 42, no. 3, pp. 42-46, March, 2009.

No comments: