I have an index which can have millions of documents with time. I need to make several aggregation on those huge amount of documents fields. In these documents, I need to count number of unique value of a field which may grow large with number of documents like more than million. I know "cardinality" aggregation. But this doesn't give right count, this gives approximation. Then I tried to used scripted_metric.
But this gives me right value only upto 1 million data. When number of documents get larger, this gives error.
So, what is the right way to make aggregation like cardinality or sum on fields over more than number of million documents to get right result not any approximation ?
On a short note, I also tried splitting my index with time. But scripted_metric gives error while it get in total more than 1 million documents from all indices.
This is what we call the Fast-Accurate-Big trilemma.
You can only have two of the three.
Elasticsearch aggregations are optimised for Fast and Big so use a single-pass over distributed data using approximation algorithms like that in the cardinality agg. Accuracy is the trade off.
Alternatively you can make multiple calls to page through the composite aggregation and count the number of results in your client. This will be BA - working with Big data and Accurate results but not especially Fast
I don’t understand. The cardinality agg doesn’t allow sub aggregations but composite does.
I suggested making multiple calls with the composite agg and it’s facilities for paging so you could count the unique values in your client as opposed to making a single request with the cardinality aggregation.
What is your business problem that requires 100% accuracy? Sometimes it takes so long to calculate accurate values the result is out of date (and therefore inaccurate) by the time you get it.
In this query, the "distinct_values" scope should return unique number of values of field3 . Number of unique values can be millions. composite aggregation cannot be used under terms aggregation. While I tried to use composite aggregation, I got this error:
"reason": "[composite] aggregation cannot be used with a parent aggregation of type: [TermsAggregatorFactory]",
I need suggestion how can I get unique value count for field3 accurately in any sub aggregation where there may be millions of documents.
Ah got it. Composite can't be used as a sub agg rather than can't have sub aggs of its own.
That is a problem. The physics challenge is trying to do this computation efficiently and accurately if you have multiple distributed shards/indexes.
Could you live with a policy where cardinality agg is used to identify using mostly-accurate measures and you drill-down for selected items of interest to get the accurate numbers using a query for the subject of interest and a composite agg at the root to do the calculation I suggested or even a scripted agg if the cardinality is low?
Again, what is the business problem you hope to solve?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.