Sorry, if I did not make it clear. For sure I know aggregation is done on
the node for each shard, but here is the challenge. Say we set
shard_size=50,000. ES will aggregate on each shard and create buckets for
the matching documents, and then send top 50,000 buckets to the client node
for Reduce. Say we have 50 data nodes, and each node has 32 shards. This
means we need to send 50,000 buckets from each shard to the client node for
final aggregation. First, this may add heavy traffic to the network (what
if we have 100 nodes?). And second, the client will need to aggregate on
received 503250,000 buckets. Would this cause any congestion on the
client node? However if we can aggregate on the node first, meaning reduce
from 32 buckets to only one bucket, then the client node only has to
process 50 buckets. This would significanly reduce the network traffic and
improve the scalability, plus because we can set relatively larger
shard_size, it will improve the accuracy of the final results, which is
another key issue we are facing in distributed environment on aggregations.
So my key question is about the scalability particularly on aggregations.
It seems to be a challenge in my experience. I just want to hear other
people's experience. On heavy analytics applications, this will be a key.
Of course, I also understand, adding node level aggregation may impact the
overall performance. I am wondering if anyone has thought about or done
anything in this aspect.
BTW, I like Elasticsearch, but want to hear from the community on some of
the key challenges.
On Thursday, December 18, 2014 9:34:07 AM UTC-5, Adrien Grand wrote:
+1 to what AlexR said. I think there is indeed a bad assumption that
shards just forward data to the coordinating node, this is not the case.
On Thu, Dec 18, 2014 at 1:09 AM, AlexR <royt...@gmail.com <javascript:>>
wrote:
if you take a terms aggregation, the heavy lifting of the aggregation is
done on each node then aggregated results are combined on the master node.
So if you have thousands of nodes and very high cardinality nested aggs the
merging part may become a bottleneck but cost of doing actual aggregation
in most cases is far higher than cost of merging results from reasonable
number of shards. So in practice I think it balances pretty well. Of course
you are not limited to one master to handle concurrent requests
On Wednesday, December 17, 2014 4:12:44 PM UTC-5, Yifan Wang wrote:
I thought ES only "Collect" on individual shards, and "Reduce" on Client
Node (master if you call it), nothing is done at the data node level.
On Tuesday, December 16, 2014 1:31:30 PM UTC-5, AlexR wrote:
ES already doing aggregations on each node. it is not like it is
shipping row level query data back to master for aggregation.
In fact, one unpleasant effect of it is that aggregation results are
not guaranteed to be precise due to distributed nature of the aggregation
for multibucket aggs ordered by count such as terms
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/61122d28-8f62-4ee2-b9e7-6fd99048ee8e%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
Adrien Grand
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3519d982-390a-4a62-9d75-0bb5e1a4654b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.