Monday, March 23, 2009

How MrTaggy is implemented...

A short time ago, we announced the MrTaggy browsing and searching engine for social bookmarks here. One of the neat features of this system is its relevance feedback mechanism which enables users to click on keywords to navigate toward the information that they are interested in.

The overall system uses a sophisticated MapReduce computation in the backend, and the implementation is non-trivial. Here is how it works. The diagram below was recently published in an IEEE Computer Magazine article, and it roughly describes how the data flows thru the whole system. (Click on it to enlarge it.)

First, a crawling module goes out to the web and crawls social tagging sites, looking for tuples of the form . Tuples are stored in a MySQL database. In our current system, we have roughly 150 million tuples.

A MapReduce system based on Bayesian inference and spreading activation then computes the probability of each URL or tag being relevant given a particular combination of other tags and URLs. Here we first construct a bigraph between URLs and tags based on the tuples and then precompute spreading activation patterns across the graph.

To do this backend computation in massively parallel way, we used the MapReduce framework provided by Hadoop (hadoop.apache org). The results of this computation are stored in a Lucene index so that we can make the retrieval of spreading activation patterns as fast as possible.

Finally, a web server serves up the search results through an interactive frontend. The frontend responds to user interaction with relevance feedback arrows by communicating with the web server using AJAX techniques and animating the interface to an updated state.

Ed H. Chi, "Information Seeking Can Be Social," IEEE Computer, vol. 42, no. 3, pp. 42-46, March, 2009.

Thursday, March 12, 2009

Information Seeking can be Social: the potential for Social Search

As part of information seeking, exploratory search involves ill-structured problems and more open-ended goals, with persistent, opportunistic, iterative, multi-faceted processes aimed more at learning than answering a specific query [Marchionini 2006]. Whereas for the fact-retrieval searches, an optimal path to the document(s) containing the required information is crucial, learning and investigation activities lead to a more continuous and exploratory process with the knowledge acquired during this “journey” being essential as well [White et al. 2007]. Therefore, information seeking systems should focus on providing cues that might make these explorations more efficient.

One possible solution is building social information seeking systems, in which social search systems utilizes social cues provided by a large number of other people. What is social search? How might we build social search systems? Is there a need for such solutions?

Researchers and practitioners now use the term “social search” to describe search systems in which social interactions or information from social sources is engaged in some way [Evans and Chi 2008]. Current social search systems can be categorized into two general classes:

(a) Social answering systems utilize people with expertise or opinions to answer particular questions in a domain. Answerers could come from various levels of social proximity, including close friends and coworkers as well as the greater public. Yahoo! Answers ( is one example of such systems. Early academic research includes Ackerman’s Answer Garden [Ackerman, 1996], and recent startups include Mechanical Zoo’s Aardvark ( and ChaCha’s mobile search (

Some systems utilize social networks to find friends or friends of friends to provide answers. Web users also use discussion forums, IM chat systems, or their favorite social networking systems like Facebook and Friendfeed to ask their social network for answers that are hard to find using traditional keyword-based systems. These systems differ in terms of their immediacy, size of the network, as well as support for expert finding.
Importantly, the effectiveness of these systems depends on the efficiency in which they utilize search and recommendation algorithms to return the most relevant past answers, allowing for better constructions of the knowledge base.

(b) Social feedback systems utilize social attention data to rank search results or information items. Feedback from users could be obtained either implicitly or explicitly. For example, social attention data could come from usage logs implicitly, or systems could explicitly ask users for votes, tags, and bookmarks. was one early example from early 2001 that used click data on search results to inform search ranking. The click data was gathered implicitly through the usage log. Others like Wikia Search (, and most recently Google, are allowing users to explicitly vote for search results to directly influence the search rankings.

Vote-based systems are becoming more and more popular recently. Google’s original ranking algorithm PageRank could also be classified as an implicit voting system by essentially treating a hyperlink as a vote for the linked content. Social bookmarking systems such as delicious allow users to search their entire database for websites that match particular popular tags.

One problem with social cues is that the feedback given by people is inherently noisy. Finding patterns within such data becomes more and more difficult as the data size grows [Chi and Mytkowicz, 2008]

In both classes, there remains opportunity to apply more sophisticated statistical and structure-based analytics to improve search experience for social searchers. For example, expertise-finding algorithms could be applied to help find answerers who can provide higher-quality answers in social answering systems. Common patterns between question-and-answer pairs could be exploited to construct semantic relationships that could be used to construct inferences in question answering systems. Data mining algorithms could construct ontologies that are useful for browsing through the tags and bookmarked documents.

1. Ackerman, M. S.;McDonald, D. W. 1996. Answer Garden 2: merging organizational memory with collaborative help. In Proceedings of the 1996 ACM Conference on Computer Supported Cooperative Work (Boston, Massachusetts, United States, November 16 - 20, 1996). M. S. Ackerman, Ed. CSCW '96. ACM, New York, NY, 97-105. DOI=
2. Chi, E. H.; Mytkowicz, T. Understanding the efficiency of social tagging systems using information theory. Proceedings of the 19th ACM Conference on Hypertext and Hypermedia; 2008 June 19-21; Pittsburgh, PA. NY: ACM; 2008; 81-88.
3. Evans, B.; Chi, E. H. Towards a Model of Understanding Social Search. In Proc. of Computer-Supported Cooperative Work (CSCW). ACM Press, 2008. San Diego, CA.
4. Marchionini, G. Exploratory search: From finding to understanding. Communications of the ACM, 49, 4 (2006), 41-46.
5. White, R.W., Drucker, S.M., Marchionini, M., Hearst, M., schraefel, m.c. Exploratory search and HCI: designing and evaluating interfaces to support exploratory search interaction. Extended Abstracts CHI 2007, ACM Press (2007), 2877-2880.