Accuracy in ES search

We have a stock of about 2 million documents containing invoice information in an Elastic db.

We want to understand:

  • If I do a count of documents with certain fixed properties (like terms)
  • Or a sum over all those documents

Is the result we get back from Elastic exact? Or should I use the rule of thumb that anything I get out of ES has a statistical margin of error?

Another one I worry about is much more complex:

  • Looking for all companies who sent only one invoice in a specific month. The list of companies should be exact… From what I got from the course, there is always some margin of error…

I’m in doubt and I can’t afford to be :blush:

It actually depends on what type of aggregation you are using and in what manner. For Instance, for terms aggregation, you will get the exact result if you apply "size":0 (which means to include all the keys). Also, you will get the exact value for the sum aggregation as well. Certain aggregations like cardinality are based on approximation though.

For the second answer, You can get the exact list of companies listing the companies with terms aggregation and setting "size":0.
For the monthly breakdown, you can use the Histogram aggregation of the date field and use terms of the company as the sub-aggregation.

Size = 0 won’t work for very high cardinality terms. You may need to look at the ‘composite’ agg or partitioning with the terms agg

Also, found that "size":0 on terms is not supported from anymore from ES-5.0 onward. You have to explicitly specify the maximum size if you have the idea of it.

Would using the other techniques give me 100% surety that I got all the documents? or would there still be a margin of error?

Thank you both for your answers!

Composite aggregation or partitioning with the terms aggregation could give you 100% accurate value if you use them in correct way. Refer to their document for their detail implementation. Also, note that Composite aggregation is the new feature in elasticsearch-6 and still in beta phase.

If you are using older version of Elasticsearch(less than 5.0), then you can use "size":0, if your terms count is less than Integer.MAX_VALUE.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.