Using multiple shards causes incorrect results to be generated

Reference: [BUG?] Wrong aggregated values shown in visualization

Currently the aggregate data returned from ES is un-reliable when plotting a few terms on the X axis and having multiple shards

Lee pointed out (and I verified) that this is a manifestation of https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

How do I force elasticSearch to return all buckets instead of a topN buckets when queried for something like Average of FieldA sorted from lowest to highest?

Since I am using Kibana, I'd like if the setting - if it exists - to be something intrinsic to the index.
If it doesnt, may I create a github issue?

Currently the aggregate data returned from ES is un-reliable when plotting a few terms on the X axis and having multiple shards

Bump.

How do I force elasticSearch to return all buckets instead of a topN buckets when queried for something like Average of FieldA sorted from lowest to highest?

I do not think you can, as this would have the potential to easily overwhelm clusters. If you make sure that you do not have unnecessarily small shards you will however reduce the margin of error.

Hi Christian,

I see this issue pop up in as few as 4 shards (4 shards because I have 4 core machine).
While I appreciate that it has the potential to overwhelm large indices, the current thresholds arent always working as seen here. As a user, it is not acceptable when something fails and doesnt tell me anything.

Right now as a user I have:

  • No warning that the returned values can be wrong
    • ES would know that it found non-identical buckets across the N shards
    • But it does not tell the user that the values may be unreliable
  • No ability to set a larger bucket count

As a user, silently misleading/bad data is more dangerous than no data. What is the utility/point if I am told convincingly that the minimum of a certain field across N buckets is 0.24 when there it can be 0.24 or 1000 (basically just about any number)? What did I learn from this DB-query that I couldn't have just guessed? Nothing, But in the worst case, I might now make decisions based on the observed value of 0.24 which is wrong - and I would have no idea until its probably too late that I am basing my decisions on bad data. To me its an annoyance (my data is not too large), but for someone else it can be catastrophically expensive.

What about:

A means to make this bucket count settable by the user on a per-index granularity (and the default value to be whatever it is now) ?

And a warning when ES detects this case?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.