Accuracy issue of aggregation results

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi Yifan,

Nothing dynamic, but you can increase the number of terms collected on each
shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_shard_size
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_minimum_document_count

Thanks,
Matt Weber

On Tue, Sep 16, 2014 at 12:36 PM, Yifan Wang yifan.wang.usa@gmail.com
wrote:

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJ3KEoCWieyr%3DW%2B_T0wxPr9L6_USLMKNQuMTNx0MOBQAaZ_VQA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Hi Matt,

Thanks for your quick response. However neither worked for us. In our case,
we set shard_size to 50K (option1 ), it is still missing documents. The
cluster became unstable if we try to further increase it. We cannot use
shard_min_doc_count_value, because even it is one hit, its value used for
bucket ordering can still be large enough to be collected. What we really
need is "weighted" collect. As a workaround we have to do multiple trips.
"Weighted collect" may have some performance penalty, but it would be
better option than multiple trips or setting large shard_size. I am
wondering if ES plugin can achieve this goal.

Thanks.

On Tuesday, September 16, 2014 4:20:55 PM UTC-4, Matt Weber wrote:

Hi Yifan,

Nothing dynamic, but you can increase the number of terms collected on
each shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_shard_size
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_minimum_document_count

Thanks,
Matt Weber

On Tue, Sep 16, 2014 at 12:36 PM, Yifan Wang <yifan.w...@gmail.com
<javascript:>> wrote:

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ff23136d-eea3-4863-bec1-3caa8edf4777%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Just in case anyone is interested, "weighted collect" (collect more on
shards of more documents) actually does not necessarily improve the
accuracy if the documents are distributed by default hash algorithm. There
is no such correlations.

On Tuesday, September 16, 2014 5:09:51 PM UTC-4, Yifan Wang wrote:

Hi Matt,

Thanks for your quick response. However neither worked for us. In our
case, we set shard_size to 50K (option1 ), it is still missing documents.
The cluster became unstable if we try to further increase it. We cannot use
shard_min_doc_count_value, because even it is one hit, its value used for
bucket ordering can still be large enough to be collected. What we really
need is "weighted" collect. As a workaround we have to do multiple trips.
"Weighted collect" may have some performance penalty, but it would be
better option than multiple trips or setting large shard_size. I am
wondering if ES plugin can achieve this goal.

Thanks.

On Tuesday, September 16, 2014 4:20:55 PM UTC-4, Matt Weber wrote:

Hi Yifan,

Nothing dynamic, but you can increase the number of terms collected on
each shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].

[1]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_shard_size
[2]
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_minimum_document_count

Thanks,
Matt Weber

On Tue, Sep 16, 2014 at 12:36 PM, Yifan Wang yifan.w...@gmail.com
wrote:

It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.

Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?

Thanks in advance.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e78571f9-d3e3-4d7c-a60e-d1a2052db397%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/393f139e-a8df-46e9-bea1-374460958a36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.