Live from ESTC 2008 – Semantic Search

Semantic search is one of the hot topics at the conference. What the players put in this though sometimes varies. The Friday morning keynote is from Hugo Zaragosa from Yahoo! Research.

Currently Yahoo! has a platform for getting better and nicer results from your search engine. Search monkey is an open platform for using structured data to build more useful and relevant search results. It is just change the snippets (result-view) adding deeplinks, images, name value pairs or abstract accessing the providers data sources.

Looking forward at challenges, Hugo shows a Yahoo! search where all search is annotated with what it is (restaurant etc.) in this you see location oriented in a map, and with faceted filtering on extracted metadata. Some of his key quotes

We move from a web of pages to a web of objects

Search is no longer about finding documents, but an interface for web mediated goals

Precision of navigational queries is solved

Document crawling and spam, indexing and retrieval, result relevance are not solved

How do you model intent?

– what is the right abstraction?

– what is the right granularity? / what ontology should we be using?

– what are the top intents?

And how do you measure relevance in the web of objects? What is the automated framework for relevance and What are the ranking models that can attain it?

 All this aspects are very relevant to what the semantic web tries to address.

Current research directions in Yahoo! Research in Barcelona covers MicroSearch, learning tags and searching objects. Microsearch (from peter mika a Yahoo! semantic web guru). This approach simply goes automatically to the pages of a resultset and gets RDF (RDFa, GRDDL) or Microformats and presents it, e.g. searching for Ivan Herman the result is shown with events in a timeline, addresses in a map. An example of trouble with intent (funny enough an example that I often use myself) is a search for Paris Hilton. Should you model the hotel or the person?

A second research direction is “learning to tag, tagging to learn”. His example is how wikipedia is moving from free text (on the left side of a page), and as metadata (the info box on the right side). The idea here is to combine the NLP on the left and the RDF on the right to create more information. NLP often loses relations but are good on types. RDF is often good on relations however weak on types.

A last topic of research is ranking objects, another topic I’ve been struggling with myself. The simple background is that you can rank results in web searches, wheras in the database world there is no ranking.  Why is this hard? Again some quotes from Hugo:

Search (Information Retrieval) technology greatly surpassed Boolean queries in the 80s 

Attempts to improve search technology with semantic knowledge have repeatedly failed. (except in a very narrow domain)Effective wquery expansion is very difficult

“Entity ranking” relates to the sorting of entities by relevance to a query. Hugo is explaining what he calls colored (typed) indexes and entity containment graphs. My oversimplified explanation is that you in the search are looking for simple triples that are extracted from the text and ranking them. 

A question from the audience relates to context, and whether Yahoo! Research is looking into it. The short answer is not in Barcelona, he however makes one nice point on this: “How much can we use of context before the user go from being happy to being spooked?”.