Thanks to all who read this for your help.
Background: I have a job that when executing inserts a document. The job can fail and run again, inserting a second document. Both have the same linkId- so I know there is only one job. I need the count of ALL unique linkIds - no matter how many there are.
I first looked at cardinality
but that has a limit of 40,000 (precision_threshold) after which it becomes inaccurate.
Then I looked into cumulative_cardinality
but that strategy uses cardinality
. But the documentation makes no mention of inaccuracy. Is this method accurate for large numbers?
My aggs section looks like this:
"aggs": {
"unique_attempts": {
"date_histogram": {
"field": "timestamp",
"calendar_interval": "day"
},
"aggs": {
"distinct_attempts": {
"cardinality": {
"field": "linkId"
}
},
"total_attempts": {
"cumulative_cardinality": {
"buckets_path": "distinct_attempts"
}
}
}
}
}
So is this an approach that will give me an accurate count of unique localId
? Or do I make multiple calls using the API and join the results myself? If API - what strategy would be best?
Many thanks this is not an easy question