A Search Engine Architecture Based on Collection Selection
Google Tech Talks
December, 19 2007
ABSTRACT
We present a distributed architecture for a Web search engine, based
on the concept of collection selection. We introduce a novel approach
to partition the collection of documents, able to greatly improve the
effectiveness of standard collection selection techniques (CORI), and
a new selection function outperforming the state of the art. Our
technique is based on the novel query-vector (QV) document model,
built from the analysis of query logs, and on our strategy of
co-clustering queries and documents at the same time.
By suitably partitioning the documents in the collection, our system
is able to select the subset of servers containing the most relevant
documents for each query. Instead of broadcasting the query to every
server in the computing platform, only the most relevant will be
polled, this way reducing the average computing cost to solve a query.
We introduce a novel strategy to use the instant load at each server
to drive the query routing. Also, we describe a new approach to
caching, able to incrementally improve the quality of the stored
results. Our caching strategy is effectively both in reducing
computing load and in improving result quality. The proposed
architecture, overall, presents a trade-off between computing cost and
result quality, and we show how to guarantee very precise results in
face of a dramatic reduction to computing load. This means that, with
the same computing infrastructure, our system can serve more users,
more queries and more documents.
Speaker: Diego Puppin
Duration : 0:33:1

Good idea, however …
Good idea, however I find only one other approach to make it speedier.
Sorry, this …
Sorry, this strategy doesn’t work well with long tail and personalized search load. The indexing cost (I’d consider cluster selection an indexing phase) is much higher as well. For aggregate performance, a much simpler caching strategy (multiple (for different types/languages etc.) doc.part + (pre-computed/trained) distributed query cache) can be built that match or outperform this complicated solution.
The crusing …
The crusing capabilities of ac tive data clouds you mean?
One day it’ll know the kind of stuff i want and i won’t even have to make entries all the time. (Standard unified ratings data).
I’ll also be able to talk to a bot wich wil adapt it’s data personality as to know me better.