How does ES merge docLists in a distributed multi-term query?

Yang_Yang · July 17, 2015, 8:06am

let's say I query by "dog AND cat"

if the doclists for "dog" is partitioned , and stored on server S1, S2, S3
and similarly that for "cat" is on S4, S5, S6
(for simplicity we don't consider replication here, so every docList part has just a single copy)

what does S1 ... S6 return ? the full doclist part each server has ? ---- that would be very long , because there would be billions of websites in the world talking about dogs. in what order are the docLists parts stored on each server? by docID ? or by score (like pageRank etc) ? then when each server returns the partial docList, is it in docID order ? or sorted by score? there has to be some sort of trimming before the complete docList part is returned from each server, how is this trimming done? by scoring? but what if one server stores consistently bad pages and another one has good pages? in that case if everyone is given a quota of top-100 entries, then the bad server unnecessarily introduce too many garbage results into the combined result.

that's a lot. I read the 1997 "Anatomy of a search engine" paper but it seems it was quite hand-wavy on many of the finer points. plus technology evolved a lot in the past 20 years, so I'd really appreciate if someone could also point me to some architecture/design docs of ES.

thanks
yang

colings86 · July 17, 2015, 8:10am

Have a look at this section of the Elasticsearch: The Definitive Guide book: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html

It describes how distributed search is performed. Other sections around this one many be worth looking at too for what you are after.

Hope that helps