I read the documentation about the distributed search execution of elasticsearch: https://www.elastic.co/guide/en/elasticsearch/guide/current/distributed-search.html
... and I was wondering if these query and fetch phases are principles of map/reduce in it's abstract sense?
For example the query phase:
map-phase: search request execution on primaries and replicas of every shard in index. Temporary result of a local sorted priority queue.
reduce-phase: Every shard returns the docIDs to the coordinating node. Coordinating node merges the results
to one final result.
So is it fair to say, that elasticsearch uses map/reduce-Algorithms internally? And if so, are there other functionalities like reindexing which use map/reduce?
Yep, you are absolutely correct Much of the search and aggregation functionality is essentially map-reduce.
Aggregations are basically 100% map-reduce. Each agg has a
collect() method which is essentially a map function. In the above example, it's collecting the value to calculate the average.
Once all the documents have been collected (mapped), the results are sent to the coordinating node and the results are reduced to calculate the global average (or whatever metric).
EDIT: Also, pipeline aggs are similar to an extra reduce phase (sorta), because it's doing processing on the results of agg buckets. So with pipelines you end up map-reduce-reduce.
Search is slightly different, since there's an extra phase to go fetch the actual document. But it's basically map-reduce too: score the documents with the query (map), send the top-n results to the coordinating node which finds global top-n (reduce), then fetch the original document source.
Reindexing has parallel semantics via slicing, but I wouldn't really classify it as map-reduce because there's not really any reduction... it's just parallelizing indexing. Anything that is built on the search infra will be map-reduce to a degree, such as suggestors. I can't think of any other examples off the top of my head, but they may be lurking there. Map-reduce is a pretty common pattern in distributed systems where problems are easily partitioned.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.