Nested cardinality values way off with filter?

Phil_Price · May 22, 2014, 8:34pm

Hello,

I'm trying to get produce the distribution of documents that matches vs
don't match a query, and get the cardinality of a field for both sets. The
idea is "Users who did" vs "Users who did not". In reality I'm actually
running another aggregation under "did not" (otherwise I'd just subtract
one count from the total), but the query here illustrates the issue I'm
having:

Query

"aggs": {
    "total_distinct_count": { "cardinality": { "field": "UserId" } },
    "has_thing": {
        "filter": { "term": { "State": "thing" } },
        "aggs": {
            "distinct_count": { "cardinality": { "field": "UserId" } }
        }
    },
    "does_not_have_thing": {
        "filter": { 
            "not" : { "term": { "State": "thing" } }
        },
        "aggs": { 
            "distinct_count": { "cardinality": { "field": "UserId" } }
        }    
    }
}

Response

"hits": {
"total": 3309709,
"max_score": 0,
"hits": []
},
"aggregations": {
"total_distinct_count": {
"value": 654556
},
"does_not_have_thing": {
"doc_count": 2575512,
"distinct_count": {
"value": 563371
}
},
"has_thing": {
"doc_count": 734197,
"distinct_count": {
"value": 223128
}
}
}

I would expect (aggregations.has_thing.dictinct_count.value +
aggregations.does_not_have_thing.distinct_count.value) to be close to
aggreations.total_distinct_count.value, but in reality it's pretty far off
(~+20%). Note: That the summation of doc_count adds up exactly to
hits.total. So I don't think this is an issue with the query, but I could
be wrong.

Any ideas whats up? Have I structured the query incorrectly, Is this a bug?
Or is this just expected behavior?

Some notes:

UserId's data type is a *long, *but the values only fill up integer
space. (510,539 to 418,346,844)
I'm running elasticsearch 1.1.0
I've tried playing around with the precision threshold, but it doesn't
appear to make a difference.

Thanks in advance,
Cheers
Phil

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb558261-7865-491e-9bc5-e3f78b6390f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jpountz · May 22, 2014, 11:36pm

distinct_countOn Thu, May 22, 2014 at 10:34 PM, Phil Price <
philprice@gmail.com> wrote:

I would expect (aggregations.has_thing.dictinct_count.value +
aggregations.does_not_have_thing.distinct_count.value) to be close to
aggreations.total_distinct_count.value, but in reality it's pretty far off

I think this result is to be expected if you have some user IDs that match
both criteria? Eg. if your index has these two documents:

{
"UserId" : 42,
"State": "thing"
}

{
"UserId" : 42,
"State": "anything"
}

Then your aggregations would look like:

"aggregations": {
"total_distinct_count": {
"value": 1
},
"does_not_have_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
},
"has_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
}
}

And the sum of the values of distinct_count per bucket is larger than the
global value for distinct_count.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6Dsf5wbALt4v7ObbeM%3DRyuHDG-ueYoNnXFwzE_TtQqdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Phil_Price · May 22, 2014, 11:48pm

Doh! You are correct, my bad. I assumed the filter was an exclusive "per
user" property, but in fact - it is not.

Thanks for getting back to me
Cheers
Phil

On Thursday, May 22, 2014 4:36:02 PM UTC-7, Adrien Grand wrote:

distinct_countOn Thu, May 22, 2014 at 10:34 PM, Phil Price <
phil...@gmail.com <javascript:>> wrote:

I would expect (aggregations.has_thing.dictinct_count.value +
aggregations.does_not_have_thing.distinct_count.value) to be close to
aggreations.total_distinct_count.value, but in reality it's pretty far off

I think this result is to be expected if you have some user IDs that match
both criteria? Eg. if your index has these two documents:

{
"UserId" : 42,
"State": "thing"
}

{
"UserId" : 42,
"State": "anything"
}

Then your aggregations would look like:

"aggregations": {
"total_distinct_count": {
"value": 1
},
"does_not_have_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
},
"has_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
}
}

And the sum of the values of distinct_count per bucket is larger than the
global value for distinct_count.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/40c4f479-1787-4931-be7d-9511dc06e1fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Filters aggregation has unexpected effects on cardinality vs filtered query Elasticsearch	1	522	July 11, 2017
Cardinality and value_count aggr values are 200-500% off Elasticsearch	4	857	March 21, 2017
Filtering a nested cardinality aggregation Elasticsearch	3	1408	June 22, 2017
Count distinct values lower than doc_count Elasticsearch	9	1513	September 21, 2018
Cardinality Aggregation gives wrong number? Elasticsearch	33	7349	March 7, 2019

Nested cardinality values way off with filter?

Related topics