How can I get accurate unique count from elasticsearch?


(Lyncean Patel) #1

I also used {"precision_threshold": 1000} but still don't get the correct count.


#2

The max threshold supported by elasticsearch is 40000. Try it. But it sometimes doesn't give the required result as expected or say accurate with other more expensive queries.


(Lyncean Patel) #3

I tried, but not able to get the correct result.


(Mark Harwood) #4

What is the expected count and the returned count in this case?


(Lyncean Patel) #5

Expected: 3921
Actual: 3671

I have visit documents which have visiti_id and visitor_id, I tried to find out the No of visitors visited.
So result should be like for 5000 (count of visit_id) visits, number of visitors are 3921(unique count of visitor_id)
I am using KIBI for data display.


(Mark Harwood) #6

OK. And what were the numbers were when you tried the max precision threshold of 40,000?


(Lyncean Patel) #7

without precision_threshold: = 3689
{"precision_threshold": 1000} = 3698
{"precision_threshold": 40000} = 3703


(Mark Harwood) #8

This smells fishy. There's a small possibility of error when set to safe mode (actual value counts are <= precision_threshold) and this comes down to the implementation which is based on counting unique hash codes. Hash collisions are rare but possible which means a possibly small amount of undercounting but the scale of your expected vs returned discrepancies (3921 vs 3703) seems off.

Can you share more about your query and mappings? How many buckets do you get if you use a terms agg on the visitor_id field with size set to 40,000? Are you sure the expected number of 3,921 is correct. I appreciate some of the data may be too large or sensitive to share publicly so feel free to message me directly with zipped results if that helps.

Cheers
Mark


(Lyncean Patel) #9

Thanks @Mark_Harwood for the reply.
I will recheck the data integrity.
Also I can't share the actual data with you so I will create the similar dummy data and try to reproduce the same issue. If the issue reproduces then I will share the test data with you.

Also I forget to mention that I filter the result first with some flag value.

Elasticsearch request body:
{
"query": {
"filtered": {
"query": {
"query_string": {
"query": "booked_by_app:Yes",
"analyze_wildcard": true
}
},
"filter": {
"bool": {
"must": [
{
"range": {
"booking_date": {
"gte": 1477324276000,
"lte": 1493049076000,
"format": "epoch_millis"
}
}
}
],
"must_not": []
}
}
}
},
"size": 0,
"aggs": {
"2": {
"cardinality": {
"field": "visitor",
"precision_threshold": 1000
}
}
}
}Preformatted text


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.