For the record, the bottleneck would not be on the master node (the node
that manages the cluster state) but on the node that coordinates the
execution of the search request, which is the node that your client
contacts. So if you are doing costly terms aggregations with high shard
sizes, it would help to round-robin between several nodes.
If you are interested in the accuracy issues of the terms aggregation, I
would recommend reading
and upgrading to elasticsearch 1.4 which now returns an error bound on the
counts, so that you know how bad the counts might be. The only way to
improve accuracy is to increase the shard size, but as you noted, this
raises issues too.
On Thu, Dec 18, 2014 at 8:27 AM, yang ming ymbloy@gmail.com wrote:
Hi All
we use the terms aggregation to get the top n authors, but the
aggregation may not return the top n authors.
As the elasticsearch guide said, the aggregated results are not always
accurate.
Indeed we can increase the shard size to get more accurate results,
but if the buckets returned by each shard are big enough, there will be a a
bottleneck in master node reducing the final result.
Is there a other way to improve the accuracy of terms aggregation?
Is there a good way to decrease the press of master node when
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.