It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.
Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?
Nothing dynamic, but you can increase the number of terms collected on each
shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].
It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.
Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?
Thanks for your quick response. However neither worked for us. In our case,
we set shard_size to 50K (option1 ), it is still missing documents. The
cluster became unstable if we try to further increase it. We cannot use
shard_min_doc_count_value, because even it is one hit, its value used for
bucket ordering can still be large enough to be collected. What we really
need is "weighted" collect. As a workaround we have to do multiple trips.
"Weighted collect" may have some performance penalty, but it would be
better option than multiple trips or setting large shard_size. I am
wondering if ES plugin can achieve this goal.
Thanks.
On Tuesday, September 16, 2014 4:20:55 PM UTC-4, Matt Weber wrote:
Hi Yifan,
Nothing dynamic, but you can increase the number of terms collected on
each shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].
On Tue, Sep 16, 2014 at 12:36 PM, Yifan Wang <yifan.w...@gmail.com
<javascript:>> wrote:
It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.
Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?
Just in case anyone is interested, "weighted collect" (collect more on
shards of more documents) actually does not necessarily improve the
accuracy if the documents are distributed by default hash algorithm. There
is no such correlations.
On Tuesday, September 16, 2014 5:09:51 PM UTC-4, Yifan Wang wrote:
Hi Matt,
Thanks for your quick response. However neither worked for us. In our
case, we set shard_size to 50K (option1 ), it is still missing documents.
The cluster became unstable if we try to further increase it. We cannot use
shard_min_doc_count_value, because even it is one hit, its value used for
bucket ordering can still be large enough to be collected. What we really
need is "weighted" collect. As a workaround we have to do multiple trips.
"Weighted collect" may have some performance penalty, but it would be
better option than multiple trips or setting large shard_size. I am
wondering if ES plugin can achieve this goal.
Thanks.
On Tuesday, September 16, 2014 4:20:55 PM UTC-4, Matt Weber wrote:
Hi Yifan,
Nothing dynamic, but you can increase the number of terms collected on
each shard to increase the accuracy [1]. Might also want to play with the
shard_min_doc_count value if you know certain shards have a low hit count
and are throwing off the aggregations [2].
It seems to be a common problem that the top N results returned from an
aggregation query is inaccurate due to uneven distribution of matching
documents on different shards, because ES will collect top N buckets from
each shard no matter actually how many hits are on each shard. It is very
often we collect buckets that should have not been collected on some
shards, but we missed buckets that should have collected on some others.
Is there a way we can collect buckets based on a dynamic "weight", for
example "total hits", on that shard?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.