[BUG?] Wrong aggregated values shown in visualization

pranav0091 · September 18, 2017, 2:37pm

I seem to have run across what is essentially a deal-breaking bug for my use case:
When Visualizing an Average of a numeric field on the Y-axis versus a name on the X-axis the data shown differs substantially depending on whether or not I have filters enabled.

Here are the details:

A few thousand (2000 - 20000) documents each having the same fields (1000-4000)
Plotting a field (say "name" ) versus the "average" of another field
- If x-axis has some terms on it, I get one value for the averages ( 393,940 in the example video)
- If I filter the X-axis terms (by clicking on it) I get another set of values (466,667.2 in the example video).

Why this is a major bug for me:
If I want to find an outlier (say, 5 lowest averages) then I cannot do so now, because what I see in the aggregated view often has no bearing with the actual aggregate for real data underneath. The issue is much worse when dealing with percentages.

I have a dataset that can be used to reproduce the bug, and I have made a video showing it in action.

Please let me know if I am doing something incorrect, or if this issue supposed to me filed under ElasticSearch.

pranav0091 · September 18, 2017, 2:42pm

Some notes:

Easier to reproduce the bug when you have a large number of documents and fields.
1. For Example the bug repro-es with 5 X-terms when having ~1000 documents of ~500 fields each
2. If I drop the document count down to 500, then i often need to drop the X-terms to 3 to exhibit the bug
ElasticSearch response seems to indicate that ALL the documents are hits, but in the Average aggregation, there are only a few hits (ie, doc_count doesnt represent all the available docs for that field )

LeeDr · September 18, 2017, 9:42pm

It might be this issue;

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

I'll look into it more, but that seems like it might be the issue.

Regards,
Lee

pranav0091 · September 19, 2017, 6:52am

Thanks Lee. My reading of the page you linked to shows it as a performance feature - which is nice and often downright essential.

I'd like to make a case for a choice to override this on this an index-by-index granularity. The thing is, while I generate most of the Visualizations, some of the consumers (who are often not savvy) also create their own visualizations - and they see this unexpected behaviour and are put off by the tool as unreliable. Having an index level option would let me ensure that they see what they expect.

EDIT:
Rerunning the dataset with 1 shard seems to make the bug go away (I used 4 shards until now) - so it seems likely that what you said is indeed the reason for the bug.
But I assume I am leaving a lot of performance on the table since my system has 4 threads available and a shard would use only 1? Would increasing copies help?

Christian_Dahlqvist · September 19, 2017, 8:43am

Can you tell us a bit more about your use case? Are you using time-based indices? How large are your shards not that you are using a single primary shard?

pranav0091 · September 19, 2017, 9:38am

No I am not using time-based indices.

I have a few different indices (less than 10) - each of which originated as a single .json file. Presently the largest one has 20k rows (documents) each with ~2k fields (the same 2000 fields on each document) - that one is ~1.5GB when stored as a .json with one document per line.

The reason why I was using 4 shards was because I assumed from my reading of the docs (correct me if I am mistaken) that one CPU-thread can work on a shard and so 4 shards seemed to be a good match for 4 CPU threads.

Going forward I have one larger index in mind that would be ~70k documents - each with the same 10-12k fields. I'd assume thats going to be 15-20GB as a .json

Christian_Dahlqvist · September 19, 2017, 9:43am

That is correct. Are you limited by CPU when you query? Is latency too high when you only use a single core?

pranav0091 · September 19, 2017, 10:00am

Not so far
I was just trying to eke out as much performance as I could - the low hanging fruits.

Why I went with 4 shards at all?
As an aside, when I began exploring Kibana+ ES I did make an attempt with a 15GB .json - 8 shards, 1 copy, ~70k documents each with ~6500 (common) fields spread on two machines (each with 4 cores, 8GB of JVM memory each, data on SSD, ping times of under 20ms between them) - that was not successful. I'd hit timeout or some other issue when either

Trying to create an Index in Kibana (after successfully posting to ES). OR
Trying to pick the index under New Visualization.

This made me concerned about performance, and therefore I dropped the number of fields massively and kept 4 shards on the same machine

pranav0091 · September 19, 2017, 7:25pm

Will increasing the number of replicas helps ES parallelize its search when there the index was created with just one shard?

Edit:
May I file an issue at github for the ability to override/set the shard_size to enforce that all samples are read?

pranav0091 · September 21, 2017, 9:33pm

Bump

LeeDr · September 26, 2017, 2:55pm

I think this is a question that you might get the best response on from posting a question on the Elasticsearch forum.

system · October 24, 2017, 2:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Average value change when filtering or changing size Kibana	2	864	December 12, 2017
Using multiple shards causes incorrect results to be generated Elasticsearch	4	1321	November 1, 2017
Metric aggregations: how to divide value and Courier of shard failed Kibana	5	4557	July 6, 2017
Aggregation across multiple indices missing match using w/wo kibana Kibana	5	1153	July 6, 2017
Different Results Based on Aggregation Size Kibana	6	1958	July 6, 2017

[BUG?] Wrong aggregated values shown in visualization

Related topics