We have a stock of about 2 million documents containing invoice information in an Elastic db.
We want to understand:
If I do a count of documents with certain fixed properties (like terms)
Or a sum over all those documents
Is the result we get back from Elastic exact? Or should I use the rule of thumb that anything I get out of ES has a statistical margin of error?
Another one I worry about is much more complex:
Looking for all companies who sent only one invoice in a specific month. The list of companies should be exact… From what I got from the course, there is always some margin of error…
It actually depends on what type of aggregation you are using and in what manner. For Instance, for terms aggregation, you will get the exact result if you apply "size":0 (which means to include all the keys). Also, you will get the exact value for the sum aggregation as well. Certain aggregations like cardinality are based on approximation though.
For the second answer, You can get the exact list of companies listing the companies with terms aggregation and setting "size":0.
For the monthly breakdown, you can use the Histogram aggregation of the date field and use terms of the company as the sub-aggregation.
Also, found that "size":0 on terms is not supported from anymore from ES-5.0 onward. You have to explicitly specify the maximum size if you have the idea of it.
Composite aggregation or partitioning with the terms aggregation could give you 100% accurate value if you use them in correct way. Refer to their document for their detail implementation. Also, note that Composite aggregation is the new feature in elasticsearch-6 and still in beta phase.
If you are using older version of Elasticsearch(less than 5.0), then you can use "size":0, if your terms count is less than Integer.MAX_VALUE.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.