[BUG?] Wrong aggregated values shown in visualization

I seem to have run across what is essentially a deal-breaking bug for my use case:
When Visualizing an Average of a numeric field on the Y-axis versus a name on the X-axis the data shown differs substantially depending on whether or not I have filters enabled.

Here are the details:

  • A few thousand (2000 - 20000) documents each having the same fields (1000-4000)
  • Plotting a field (say "name" ) versus the "average" of another field

Why this is a major bug for me:
If I want to find an outlier (say, 5 lowest averages) then I cannot do so now, because what I see in the aggregated view often has no bearing with the actual aggregate for real data underneath. The issue is much worse when dealing with percentages.

I have a dataset that can be used to reproduce the bug, and I have made a video showing it in action.

Please let me know if I am doing something incorrect, or if this issue supposed to me filed under ElasticSearch.

Some notes:

  1. Easier to reproduce the bug when you have a large number of documents and fields.

    1. For Example the bug repro-es with 5 X-terms when having ~1000 documents of ~500 fields each
    2. If I drop the document count down to 500, then i often need to drop the X-terms to 3 to exhibit the bug
  2. ElasticSearch response seems to indicate that ALL the documents are hits, but in the Average aggregation, there are only a few hits (ie, doc_count doesnt represent all the available docs for that field )

It might be this issue;

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

I'll look into it more, but that seems like it might be the issue.

Regards,
Lee

1 Like

Thanks Lee. My reading of the page you linked to shows it as a performance feature - which is nice and often downright essential.

I'd like to make a case for a choice to override this on this an index-by-index granularity. The thing is, while I generate most of the Visualizations, some of the consumers (who are often not savvy) also create their own visualizations - and they see this unexpected behaviour and are put off by the tool as unreliable. Having an index level option would let me ensure that they see what they expect.

EDIT:
Rerunning the dataset with 1 shard seems to make the bug go away (I used 4 shards until now) - so it seems likely that what you said is indeed the reason for the bug.
But I assume I am leaving a lot of performance on the table since my system has 4 threads available and a shard would use only 1? Would increasing copies help?

Can you tell us a bit more about your use case? Are you using time-based indices? How large are your shards not that you are using a single primary shard?

No I am not using time-based indices.

I have a few different indices (less than 10) - each of which originated as a single .json file. Presently the largest one has 20k rows (documents) each with ~2k fields (the same 2000 fields on each document) - that one is ~1.5GB when stored as a .json with one document per line.

The reason why I was using 4 shards was because I assumed from my reading of the docs (correct me if I am mistaken) that one CPU-thread can work on a shard and so 4 shards seemed to be a good match for 4 CPU threads.

Going forward I have one larger index in mind that would be ~70k documents - each with the same 10-12k fields. I'd assume thats going to be 15-20GB as a .json

That is correct. Are you limited by CPU when you query? Is latency too high when you only use a single core?

Not so far :slight_smile:
I was just trying to eke out as much performance as I could - the low hanging fruits.

Why I went with 4 shards at all?
As an aside, when I began exploring Kibana+ ES I did make an attempt with a 15GB .json - 8 shards, 1 copy, ~70k documents each with ~6500 (common) fields spread on two machines (each with 4 cores, 8GB of JVM memory each, data on SSD, ping times of under 20ms between them) - that was not successful. I'd hit timeout or some other issue when either

  1. Trying to create an Index in Kibana (after successfully posting to ES). OR
  2. Trying to pick the index under New Visualization.

This made me concerned about performance, and therefore I dropped the number of fields massively and kept 4 shards on the same machine

Will increasing the number of replicas helps ES parallelize its search when there the index was created with just one shard?

Edit:
May I file an issue at github for the ability to override/set the shard_size to enforce that all samples are read?

Bump

I think this is a question that you might get the best response on from posting a question on the Elasticsearch forum.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.