Nested cardinality values way off with filter?


(Phil Price) #1

Hello,

I'm trying to get produce the distribution of documents that matches vs
don't match a query, and get the cardinality of a field for both sets. The
idea is "Users who did" vs "Users who did not". In reality I'm actually
running another aggregation under "did not" (otherwise I'd just subtract
one count from the total), but the query here illustrates the issue I'm
having:

Query

"aggs": {
    "total_distinct_count": { "cardinality": { "field": "UserId" } },
    "has_thing": {
        "filter": { "term": { "State": "thing" } },
        "aggs": {
            "distinct_count": { "cardinality": { "field": "UserId" } }
        }
    },
    "does_not_have_thing": {
        "filter": { 
            "not" : { "term": { "State": "thing" } }
        },
        "aggs": { 
            "distinct_count": { "cardinality": { "field": "UserId" } }
        }    
    }
}

Response

"hits": {
"total": 3309709,
"max_score": 0,
"hits": []
},
"aggregations": {
"total_distinct_count": {
"value": 654556
},
"does_not_have_thing": {
"doc_count": 2575512,
"distinct_count": {
"value": 563371
}
},
"has_thing": {
"doc_count": 734197,
"distinct_count": {
"value": 223128
}
}
}

I would expect (aggregations.has_thing.dictinct_count.value +
aggregations.does_not_have_thing.distinct_count.value) to be close to
aggreations.total_distinct_count.value, but in reality it's pretty far off
(~+20%). Note: That the summation of doc_count adds up exactly to
hits.total. So I don't think this is an issue with the query, but I could
be wrong.

Any ideas whats up? Have I structured the query incorrectly, Is this a bug?
Or is this just expected behavior?

Some notes:

  • UserId's data type is a *long, *but the values only fill up integer
    space. (510,539 to 418,346,844)
  • I'm running elasticsearch 1.1.0
  • I've tried playing around with the precision threshold, but it doesn't
    appear to make a difference.

Thanks in advance,
Cheers
Phil

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cb558261-7865-491e-9bc5-e3f78b6390f3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Adrien Grand) #2

distinct_countOn Thu, May 22, 2014 at 10:34 PM, Phil Price <
philprice@gmail.com> wrote:

I would expect (aggregations.has_thing.dictinct_count.value +
aggregations.does_not_have_thing.distinct_count.value) to be close to
aggreations.total_distinct_count.value, but in reality it's pretty far off

I think this result is to be expected if you have some user IDs that match
both criteria? Eg. if your index has these two documents:

{
"UserId" : 42,
"State": "thing"
}

{
"UserId" : 42,
"State": "anything"
}

Then your aggregations would look like:

"aggregations": {
"total_distinct_count": {
"value": 1
},
"does_not_have_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
},
"has_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
}
}

And the sum of the values of distinct_count per bucket is larger than the
global value for distinct_count.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j6Dsf5wbALt4v7ObbeM%3DRyuHDG-ueYoNnXFwzE_TtQqdg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Phil Price) #3

Doh! You are correct, my bad. I assumed the filter was an exclusive "per
user" property, but in fact - it is not.

Thanks for getting back to me
Cheers
Phil

On Thursday, May 22, 2014 4:36:02 PM UTC-7, Adrien Grand wrote:

distinct_countOn Thu, May 22, 2014 at 10:34 PM, Phil Price <
phil...@gmail.com <javascript:>> wrote:

I would expect (aggregations.has_thing.dictinct_count.value +
aggregations.does_not_have_thing.distinct_count.value) to be close to
aggreations.total_distinct_count.value, but in reality it's pretty far off

I think this result is to be expected if you have some user IDs that match
both criteria? Eg. if your index has these two documents:

{
"UserId" : 42,
"State": "thing"
}

{
"UserId" : 42,
"State": "anything"
}

Then your aggregations would look like:

"aggregations": {
"total_distinct_count": {
"value": 1
},
"does_not_have_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
},
"has_thing": {
"doc_count": 1,
"distinct_count": {
"value": 1
}
}
}

And the sum of the values of distinct_count per bucket is larger than the
global value for distinct_count.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/40c4f479-1787-4931-be7d-9511dc06e1fb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4