Accuracy issue of aggregation results

Yifan_Wang · September 16, 2014, 7:36pm

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

mattweber · September 16, 2014, 8:20pm

Hi Yifan,

Nothing dynamic, but you can increase the number of terms collected on each
shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].

[1]

[2]

Thanks,
Matt Weber

On Tue, Sep 16, 2014 at 12:36 PM, Yifan Wang yifan.wang.usa@gmail.com
wrote:

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoCWieyr%3DW%2B_T0wxPr9L6_USLMKNQuMTNx0MOBQAaZ_VQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Yifan_Wang · September 16, 2014, 9:09pm

Hi Matt,

Thanks for your quick response. However neither worked for us. In our case,
we set shard_size to 50K (option1 ), it is still missing documents. The
cluster became unstable if we try to further increase it. We cannot use
shard_min_doc_count_value, because even it is one hit, its value used for
bucket ordering can still be large enough to be collected. What we really
need is "weighted" collect. As a workaround we have to do multiple trips.
"Weighted collect" may have some performance penalty, but it would be
better option than multiple trips or setting large shard_size. I am
wondering if ES plugin can achieve this goal.

Thanks.

On Tuesday, September 16, 2014 4:20:55 PM UTC-4, Matt Weber wrote:

Hi Yifan,

Nothing dynamic, but you can increase the number of terms collected on
each shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

Thanks,
Matt Weber

On Tue, Sep 16, 2014 at 12:36 PM, Yifan Wang <yifan.w...@gmail.com
<javascript:>> wrote:

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ff23136d-eea3-4863-bec1-3caa8edf4777%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Yifan_Wang · December 17, 2014, 10:24pm

Just in case anyone is interested, "weighted collect" (collect more on
shards of more documents) actually does not necessarily improve the
accuracy if the documents are distributed by default hash algorithm. There
is no such correlations.

On Tuesday, September 16, 2014 5:09:51 PM UTC-4, Yifan Wang wrote:

Hi Matt,

Thanks for your quick response. However neither worked for us. In our
case, we set shard_size to 50K (option1 ), it is still missing documents.
The cluster became unstable if we try to further increase it. We cannot use
shard_min_doc_count_value, because even it is one hit, its value used for
bucket ordering can still be large enough to be collected. What we really
need is "weighted" collect. As a workaround we have to do multiple trips.
"Weighted collect" may have some performance penalty, but it would be
better option than multiple trips or setting large shard_size. I am
wondering if ES plugin can achieve this goal.

Thanks.

On Tuesday, September 16, 2014 4:20:55 PM UTC-4, Matt Weber wrote:

Hi Yifan,

Nothing dynamic, but you can increase the number of terms collected on
each shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].

[1]
Elasticsearch Platform — Find real-time answers at scale | Elastic
[2]
Elasticsearch Platform — Find real-time answers at scale | Elastic

Thanks,
Matt Weber

On Tue, Sep 16, 2014 at 12:36 PM, Yifan Wang yifan.w...@gmail.com
wrote:

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/393f139e-a8df-46e9-bea1-374460958a36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Aggregation query Elasticsearch	2	321	July 6, 2017
@uboness how to improve the accuracy of terms aggregation Elasticsearch	2	491	July 6, 2017
How can i improve accuracy of term aggregation? Kibana	4	3060	May 10, 2018
Using multiple shards causes incorrect results to be generated Elasticsearch	4	1274	November 1, 2017
Accuracy of aggregation when having queries Elasticsearch	2	442	July 6, 2017

Accuracy issue of aggregation results

Related topics