Confused about interaction between unique count and buckets

Let's say I set my time range to be 2013-01-01 00:00:00.000 to 2013-01-31 23:59:59.999 (entire month of January 2013). Then I do a line graph visualization and keep the default y-axis setup which is a count of _all documents in my index in that time range, it produces 1,565 hits. (just a dot at 1,565, since I haven't defined an x-axis).

One of the fields in my documents is user_id. If I then change the y-axis to do a Count of user_id in that same time period, I get 1,565 hits. So far, nothing strange - every document has a user_id, as I expect.

Then I change the y-axis to be a Unique Count of user_id instead, I get 488 hits. So far, so good - 488 unique users this month.

If I then add a bucket - doesn't matter what kind, can be x-axis or split lines - on the scripted field term 'year' which is defined as doc['@timestamp'].getYear(), my hits go to 520.

I don't understand this - where are the extra documents coming from?? How can I have 488 unique user_ids in my time period Jan 2013, but then have 520 unique user_ids in time period Jan 2013 when I add year on the x-axis? "unique user_ids by year" should not differ from "unique user_ids" when the time period is just Jan 2013!

This only seems to affect Unique Count. If I do the original plain count (1,565) and then add the 'year' sub-bucket, I still get 1,565, which is what I expect.

So what am I not understanding about the interplay between Unique Count and buckets?

Did you arrive at 520 by adding up the number of unique user_id's by year?

If so then this discrepancy is probably caused by user_id's that show up in multiple years. The unique count of user ids is calculated within each year, and only for that year, so if user:1 shows up in both 2012 and 2013 it will essentially be counted twice.

But my time range was restricted to Jan 2013, so documents from other years
should not have been counted at all, right?

I guess I didn't directly answer your question - since the time range was
restricted to Jan 2013, there was nothing to add up, because the graph of
unique user_id by year only had one data point, at 520 on the y-axis and
2013 on the x-axis.

Thanks,
Fran

Hi, could you share the Elasticsearch request and response pairs before and after you add the bucket on the scripted field year? To do this you can pull up the spy panel for the visualization (once before and once after adding the bucket) by clicking the little ^ at the bottom of the chart. Then click on the Request and Response buttons. Thanks!

Request and response with no bucket:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "analyze_wildcard": true,
          "query": "*"
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": 1357020000000,
                  "lte": 1359698399999,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  },
  "aggs": {
    "1": {
      "cardinality": {
        "field": "user_id"
      }
    }
  }
}

{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "hits": {
    "total": 1565,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "1": {
      "value": 488
    }
  }
}

Request and response after adding year bucket:

{
  "size": 0,
  "query": {
    "filtered": {
      "query": {
        "query_string": {
          "analyze_wildcard": true,
          "query": "*"
        }
      },
      "filter": {
        "bool": {
          "must": [
            {
              "range": {
                "@timestamp": {
                  "gte": 1357020000000,
                  "lte": 1359698399999,
                  "format": "epoch_millis"
                }
              }
            }
          ],
          "must_not": []
        }
      }
    }
  },
  "aggs": {
    "2": {
      "terms": {
        "script": "doc['@timestamp'].getYear()",
        "lang": "expression",
        "size": 1,
        "order": {
          "_term": "asc"
        },
        "valueType": "float"
      },
      "aggs": {
        "1": {
          "cardinality": {
            "field": "user_id"
          }
        }
      }
    }
  }
}

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "hits": {
    "total": 1565,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "2": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "1": {
            "value": 520
          },
          "key": 2013,
          "doc_count": 1565
        }
      ]
    }
  }
}

Yeah, I'm not sure why there's this discrepancy. I'm moving this post to the Elasticsearch category hoping someone there can explain what's going on here.

The cardinality aggregation calculates an approximate value in order to scale to large number of shards and volumes of data. You can experiment with the precision threshold parameter in order to increase accuracy. This can be specified as additional JSON in Kibana visualisations.