Cardinality Aggregation - Different Unique Counts!

Hey guys,

I am still trying to figure this out, 2 ElasticSearch queries, unique count on one is

112019

And the other :

164322

Here are the Requests and responses http://pastebin.com/4kHJiL9t and http://pastebin.com/KL9XyQ6w

The actual count match for both but not the unique.

FYI, the ip_address field type is string, Im storing IPs like I get them from client (xxx.xxx.xxx.xxx)

I just can't understand why they are coming up different, and honestly would like to know which is probably right.

Can anyone help?

Thanks so much

Maybe it is a surprise to you but cardinality aggregation unique count is just an approximated count using hyperloglog++ algorithm.

http://www.elastic.co/guide/en/elasticsearch/reference/1.x/search-aggregations-metrics-cardinality-aggregation.html#_counts_are_approximate

HI jprante, thanks for your reply!

Yes, I did know the unique count is only an estimate. What I can't understand is, why both queries as I pasted on pastebin, give completely different results when it comes to the unique count.

I would like to know what Im doing wrong as they both should match?

Thanks again!

You are not doing anything wrong. In the first query, the cardinality approximation runs over the result set of the query once. In the second query, you run a date histogram and run cardinality approximations on each histo bucket. So the approximation of the cardinality is different in the two queries.

Thanks again. I see what you are saying now.

My concern is, how accurate are they?

I am using my results to compare data against traffic purchased from ad networks. So the unique count is used to match against what we purchase at the other end, so our analytical team would pour over these numbers to make sure what we get is what we purchased, hence the need for accuracy. We really love using elastic search and its speed, seems that only thing of concern would be a unique count of ip addresses!

Would the field type affect accuracy? At the moment, we store ip address in their "native" format (xxx.xxx.xxx.xxx), field type = string. Would it make any diff storing it say, INT by converting the ip to a long number (i.e.: using php's ip2long() )?

Thanks again for the support!

You can read in the documentation there is an error rate of 5% in the approximation.

I do not recommend approximative cardinality aggregations as a method to control financial transactions.

How do you deduce the field cardinality from the second query? It seems to me that you are summing up unique counts per bucket which is incorrect since some ip addresses might fall into several hourly buckets and could thus be counted several times.

Hi @jpountz

Im not sure I understand what you mean by "deduce the field cardinality". Sorry, a bit new to ElasticSearch :smile:

Yes, Im summing up the unique counts per bucket on the second query. I wish I knew the proper approach to this if Im doing anything wrong. According to @jprante it seems right, I just wanted to get the counts to at least match on both queries! Or at least another way to count uniques.

Every time our system gets an ad impression we write a document to ES and have a stats page where uses can see number of impressions per supplier_id. I need to also show a unique count based on unique ad hits. Using mySQL I simply use a DISTINCT on the ip_address field, but now I have come up with this problem on ES!

@hmpmarketing Imagine you have two documents:

{ "@timestamp": "2015-05-11T00:00:00.000Z", "ip_address": "192.168.0.1" }
{ "@timestamp": "2015-05-11T01:00:00.000Z", "ip_address": "192.168.0.1" }

If you run a cardinality aggregation on all document you will have a count of 1 because both documents have the same ip address.

However if you build an hourly date histogram and run a cardinality aggregation on each bucket, both buckets will have a unique count of 1 because the 192.168.0.1 ip address appears once in each bucket. So the sum of the cardinalities on each bucket would be two.

@jpountz I forgot to mention I think, the required unique_count is based on a 24h period. So my objetive is to show unique ad hits within a 24h period per supplier id. So as per your example, for me the count of 1 would be the right one.

But this is not what your second query does? Your second query builds one bucket per hour and then computes how many documents there are inside each hourly bucket.

@jpountz yes. But my reason for posting is why both queries unique count don't match when querying between the 2 timestamps? (1430953200000 and 1431039599999). The first query returns 112019 and the date histogram returns 164322 :frowning:

There is no way to know the number of unique ip addresses between 1430953200000 and 1431039599999 with the result of the 2nd query. It is wrong to assume that the sum of the unique counts of ip addresses for each hourly bucket should be equal to the unique count of ip addresses for the whole range. See my previous example.

Hi @jpountz Good morning (UK here)!

Thans for your kind reply. I suppose I am expecting something that won't really be possible with ElasticSearch.

Do you suggest any other approach to count the uniqueness of documents or I am basically helpless?

Actually I'm still not exactly sure to know what you are after, could you try to explain what you would like to compute with this aggregation? Do I get it right that you would like to only get counts for new ip addresses that have not been seen before?

If this is the case, you could work it out by using a range aggregation and building your buckets manually. I tried to build an example:

DELETE test 

PUT test 
{
  "mappings": {
    "test": {
      "properties": {
        "@timestamp": {
          "type": "date"
        },
        "ip_address": {
          "type": "ip"
        }
      }
    }
  }
}

PUT test/test/1
{
  "@timestamp": "2015-05-11T11:05:00",
  "ip_address": "192.168.0.1"
}

PUT test/test/2
{
  "@timestamp": "2015-05-11T12:35:00",
  "ip_address": "192.168.0.2"
}

PUT test/test/3
{
  "@timestamp": "2015-05-11T13:02:00",
  "ip_address": "192.168.0.1"
}


GET test/_search
{
  "aggs": {
    "by_hour": {
      "date_range": {
        "field": "@timestamp", 
        "ranges": [
          {
            "from": "2015-05-11T11:00:00",
            "to": "2015-05-11T12:00:00"
          },
                    {
            "from": "2015-05-11T11:00:00",
            "to": "2015-05-11T13:00:00"
          },
                    {
            "from": "2015-05-11T11:00:00",
            "to": "2015-05-11T14:00:00"
          }
        ]
      },
      "aggs": {
        "unique_ip_count": {
          "cardinality": {
            "field": "ip_address"
          }
        }
      }
    }
  }
}

And for instance the response would tell you that there were 2 unique ip addresses between 11AM and 1PM as well as between 11AM and 14AM so you would know that no new ip addresses were seen between 1PM and 2PM.

Hi @jpountz! Again appreciate your reply. Sorry I was in transit yesterday and could not reply.

Let me try to explain my scenario and what Im after:

We are an adserver. Every ad hit that comes to our adservers, we create a document on ES that looks like this:

http://pastebin.com/gZTN8mAm

My current project is to switch our statistics page to ES. On our stats page we display information pertaining to each campaign_id where users can check out info about a campaign, the most important metric being impressions and uniques by campaign_id.

At the moment, we have some really cumbersome system to count impressions/uniques by using nginx logs instead of a straight INSERT into mySQL the moment the ad hit happens, since we deal with a lot of traffic. So our current setup involves parsing the info from the logs to database, where I simply use a DISTINCT() on the ip_address field to count unique hits between dates inputted by user. A typical query would be something like:

SELECT campaign_id, count(*) as impressions, count(distinct(ip_address)) as uniques FROM table WHERE timestamp BETWEEN X AND Y

The plan now is to collect all this data to ES instead and run queries to it, and be able to calculate the impressions as well as an approximate number of unique users that have hit our ads between 2 timestamps based on their IP address since this is probably the most accurate way to describe a single visitor.

I gave 2 examples on my first post because I will use the date histogram to generate an hourly chart as well as a total unique count per campaign when needed by user.

I hope I have explained it properly!

Thanks so much again!

I forgot to mention, the queries will always be 24h periods, so between 2 days = 48h, 3 = 72h and so on

Thanks for the explanation. The hourly chart makes sense, I think the only thing here is that you beware of is that the daily cardinality is not the sum of the hourly cardinalities since some ip addresses might appear on several hourly buckets.