Using multiple shards causes incorrect results to be generated

pranav0091 · September 26, 2017, 7:07pm

Reference: [BUG?] Wrong aggregated values shown in visualization

Currently the aggregate data returned from ES is un-reliable when plotting a few terms on the X axis and having multiple shards

Lee pointed out (and I verified) that this is a manifestation of https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

How do I force elasticSearch to return all buckets instead of a topN buckets when queried for something like Average of FieldA sorted from lowest to highest?

Since I am using Kibana, I'd like if the setting - if it exists - to be something intrinsic to the index.
If it doesnt, may I create a github issue?

Currently the aggregate data returned from ES is un-reliable when plotting a few terms on the X axis and having multiple shards

pranav0091 · October 3, 2017, 6:07am

Bump.

How do I force elasticSearch to return all buckets instead of a topN buckets when queried for something like Average of FieldA sorted from lowest to highest?

Christian_Dahlqvist · October 3, 2017, 8:02am

I do not think you can, as this would have the potential to easily overwhelm clusters. If you make sure that you do not have unnecessarily small shards you will however reduce the margin of error.

pranav0091 · October 4, 2017, 7:43am

Hi Christian,

I see this issue pop up in as few as 4 shards (4 shards because I have 4 core machine).
While I appreciate that it has the potential to overwhelm large indices, the current thresholds arent always working as seen here. As a user, it is not acceptable when something fails and doesnt tell me anything.

Right now as a user I have:

No warning that the returned values can be wrong
- ES would know that it found non-identical buckets across the N shards
- But it does not tell the user that the values may be unreliable
No ability to set a larger bucket count

As a user, silently misleading/bad data is more dangerous than no data. What is the utility/point if I am told convincingly that the minimum of a certain field across N buckets is 0.24 when there it can be 0.24 or 1000 (basically just about any number)? What did I learn from this DB-query that I couldn't have just guessed? Nothing, But in the worst case, I might now make decisions based on the observed value of 0.24 which is wrong - and I would have no idea until its probably too late that I am basing my decisions on bad data. To me its an annoyance (my data is not too large), but for someone else it can be catastrophically expensive.

What about:

A means to make this bucket count settable by the user on a per-index granularity (and the default value to be whatever it is now) ?

And a warning when ES detects this case?

system · November 1, 2017, 7:43am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Incorrect Aggregations returned from ES Elasticsearch	3	974	July 6, 2017
Inconsistent aggregation results Elasticsearch	1	423	December 5, 2019
Inconsistencies in value count aggregation Elasticsearch aggregations	4	227	April 13, 2024
Aggregation query Elasticsearch	2	322	July 6, 2017
Accuracy issue of aggregation results Elasticsearch	4	2411	July 6, 2017

Using multiple shards causes incorrect results to be generated

Related topics