A Search Engine Architecture Based on Collection Selection

2 A Search Engine Architecture Based on Collection SelectionGoogle Tech Talks
December, 19 2007

ABSTRACT

We present a distributed architecture for a Web search engine, based
on the concept of collection selection. We introduce a novel approach
to partition the collection of documents, able to greatly improve the
effectiveness of standard collection selection techniques (CORI), and
a new selection function outperforming the state of the art. Our
technique is based on the novel query-vector (QV) document model,
built from the analysis of query logs, and on our strategy of
co-clustering queries and documents at the same time.
By suitably partitioning the documents in the collection, our system
is able to select the subset of servers containing the most relevant
documents for each query. Instead of broadcasting the query to every
server in the computing platform, only the most relevant will be
polled, this way reducing the average computing cost to solve a query.
We introduce a novel strategy to use the instant load at each server
to drive the query routing. Also, we describe a new approach to
caching, able to incrementally improve the quality of the stored
results. Our caching strategy is effectively both in reducing
computing load and in improving result quality. The proposed
architecture, overall, presents a trade-off between computing cost and
result quality, and we show how to guarantee very precise results in
face of a dramatic reduction to computing load. This means that, with
the same computing infrastructure, our system can serve more users,
more queries and more documents.

Speaker: Diego Puppin

Duration : 0:33:1


3 Responses to “A Search Engine Architecture Based on Collection Selection”

  • AskASearchEngineGuru says:

    Good idea, however …
    Good idea, however I find only one other approach to make it speedier.

  • vicaya says:

    Sorry, this …
    Sorry, this strategy doesn’t work well with long tail and personalized search load. The indexing cost (I’d consider cluster selection an indexing phase) is much higher as well. For aggregate performance, a much simpler caching strategy (multiple (for different types/languages etc.) doc.part + (pre-computed/trained) distributed query cache) can be built that match or outperform this complicated solution.

  • wildchildplasma says:

    The crusing …
    The crusing capabilities of ac tive data clouds you mean?
    One day it’ll know the kind of stuff i want and i won’t even have to make entries all the time. (Standard unified ratings data).
    I’ll also be able to talk to a bot wich wil adapt it’s data personality as to know me better.

Leave a Reply

Free SEO Report…
Proud Contributor…
  • Facebook IPO By the Numbers May 18, 2012
    Facebook made a change in the price of the stock for its initial public offering that had an interesting effect on the number of shares that stakeholders decided to put up for sale. The biggest IPO ever for a tech company is happening today. What can we expect? […]
  • Using Semantics for Keyword Research May 15, 2012
    Semantics concerns the meaning of words – historically a weak area for search engines. Over the years we've seen vast improvement in Google's ability to understand what searchers mean when they enter keywords. You can capitalize on this fact by changing the way you conduct keyword research. Following these tips will also strengthen your website […]
  • Yahoo CEO Scott Thompson Resigns May 14, 2012
    Just ten days after the revelation that Scott Thompson did not, in fact, hold the computer science degree he claimed on his resume, the new Yahoo CEO is resigning his position. Thompson, who replaced Carol Bartz, held the position barely four months. He will in turn be replaced by Ross Levinsohn, the company's global media head, as interim CEO. […]
  • Bing Launches Social Sidebar May 11, 2012
    When Microsoft cleaned up Bing's results pages earlier this month, a number of observers wondered if it was trying to imitate an earlier, less cluttered version of Google. In actuality, the software giant was clearing the decks for a new interpretation of social search. Meet the Social Sidebar. […]
  • Write Content For the Four Buying Personalities May 9, 2012
    In my previous article I talked about four main goals or “personas” you should keep in mind when writing content for your website. Your visitors display a lot more variety than that, though. Four different visitors could have the same goal, but approach it in completely different ways. If you want to sell to all of them, you need to write with their buying p […]
  • Write SEO Content for Your Visitor`s Goals May 7, 2012
    How do your website's visitors spend their time online? That can vary not only from person to person, but at different times with the same person. Why? Users pursue a variety of different goals. If you keep this in mind, you can write effective content to optimize your site for the most likely goals. […]
  • Guest Posting: How to Find and Seduce Your Editor May 3, 2012
    You know all the benefits that writing a guest post for a well-read blog in your niche can offer. You know you can write great content because you're an expert in your field. But how do you get into guest blogging? Let me give you a few clues from the blog editor's perspective. […]
  • Title Tags: Not Just for Keywords Anymore May 1, 2012
    I very nearly titled this article “The Truth About Keywords in Title Tags.” I didn't because I'm no longer sure that anyone has all of it. If you're ready to rethink one of the most basic things you've ever learned about SEO, and stop simply reacting to Google, keep reading. […]
  • Penguin Joins Panda in Google Web Spam War April 30, 2012
    Early last week, Google began using a new algorithm to help it combat webspam from black hat SEOs. Dubbed Penguin, it aims to eliminate from the search engine's listings websites that engage in certain shady practices. But how well does it work? […]
  • Facebook Releases Negative Report Before IPO April 26, 2012
    Could Facebook's cash machine be slowing down? That's one possible conclusion observers can draw from the paperwork the company recently filed with the Securities and Exchange Commission. While it's not likely to slow down investors, it's not the best news to get so close to the social media giant's IPO. […]
  • 6 Tools to Manage Your Twitter Followers May 18, 2012
    Managing Twitter followers can become a time consuming task, taking time away from actually sending messages and growing your influence. Here are a few free and paid tools that will save you time and provide all the important data you need. […]
    Duncan Parry
  • 4 Ways to Rethink a Facebook Advertising Campaign May 18, 2012
    Facebook is a different medium than paid search advertising, with entirely unique advantages. Learn ways to restructure your paid social advertising paradigm without over-committing to media spend and track results and ROI on Facebook. […]
    John Lynch
  • For Better Facebook Engagement, Post on Topics Related To, But Not About, Your Brand [Study] May 18, 2012
    Facebook shares tips for Page managers looking to increase fan engagement with specific types of content posted to drive different types of engagement. Find out what they learned in an internal brand pages study of 23 brands across six industries. […]
    Miranda Miller
  • Google Launches Knowledge Graph, 'First Step in Next Generation Search' May 17, 2012
    Three new Google Search features are part of their “next generation search” project, Knowledge Graph. Info boxes, segmented results based on query context and suggestions based on popular queries negate the need to even click off the SERP for info. […]
    Miranda Miller
  • Mobile Sites: Choosing an Implementation Process & Strategies May 17, 2012
    No matter your approach, the mobile landscape is a tricky, expansive space of uncertainty filled with twists and turns that would give even the most solid minded developer or site owner points to pause. Here’s a guide to help you go mobile. […]
    Kristine Schachinger
May 2012
M T W T F S S
« Aug    
 123456
78910111213
14151617181920
21222324252627
28293031