API Search gives me a wrong bucket count


#1
    *Elasticsearch version1.4.4*:

OS versionRHEL7.2:

Description of the problem including expected versus actual behavior:

Hi,

i use elasticsearch to collect logmessages. I try to get a overview of
all hosts and the sum of logfiles in the last hour. To get the result i
use the python client from elasticsearch and these query:

{

"aggs": {

    "hosts" : {

        "filter" : {

            "range" : {

                "@timestamp" : { "gt" : "now-1h" }

            }

        },

        "aggs" : {

            "logs_per_host" : {

                "terms" : {

                    "field" : "logsource",

                    "size" : 5000

                }

            }

        }

    }

}, "size" : 0

})

The field "logsource" contains the unique hostname of each server.
The query runs well and i got buckets with the doc_count of each host.
The problem is the count of some hosts seems to be wrong. The query
counts ~ 8000 logs in the last hour. If i verify the value of these
hosts with kibana the count for this host is ~4500 logs. I also verify
the count of this host with this es query:

{

    "aggs" : {

        "host" : { "filter" : { "term" : { "logsource" : hostname } },

            "aggs" : {

                 "logs_per_hour" : {

                    "date_histogram" : {

                        "field" : "@timestamp",

                        "interval" : "1h",

                        "order" : { "_count" : "asc" }

                    }

                }

            }

        }

}

This shows me that the host has ~ 4000 Logs per our, so the first query
seems to be wrong. I dont know if this is a bug or the query is wrong...
Some counts from the first query seems to be okay because the values
matches with kibana and the secound query.

clintongormley told me on github:
Hi @xoxys

You're using a top-level filter in the first query which is applied AFTER aggs are calculated.

But i dont know what this means. Can someone explain this a little bit more?
Thanks


(Zachary Tong) #2

You can read more about the post_filter her:

Basically, an aggregation gets the set of documents to aggregate from the "query" clause. So any documents matching the query will be aggregated. The filtering done by a post_filter happens after the query, meaning the aggregation results are not affected by the post_filter

To your first question: Terms aggregations can be approximate, depending on the sizes set. You can read more about it here: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts

It's not clear to me if this is your problem, perhaps you could show some of the results that look wrong?


#3

Sorry for the bad explanation. What i try to get is an overview of all Hosts and the count of logfiles for the last hour per host.

Buckets actually looks like this:
Got 1076571 Hits
{ u'buckets': [ { u'doc_count': 8637, u'key': u'hosname1'},
{ u'doc_count': 4024, u'key': u'hostname2'},

Looks fine, the problem is that the count for hostname1 is not correct. Kibana tells me for this host at the last hour 3800. But the count for hostname2 is the same in elasticseacrh and Kibana. So it seems there is no problem in general but some hosts does not match...

If i try this query:
"aggs" : {
"host" : { "filter" : { "term" : { "logsource" : "hostname1" } },
"aggs" : {
"logs_per_hour" : {
"date_histogram" : {
"field" : "@timestamp",
"interval" : "1h",
"order" : { "_count" : "asc" }
}
}
}
}
i got the same count as Kibana (3800) so i think this count is the right one. The question is why i got ~ 8700 with the first query? And why some hosts matches with kibana and some hosts not?
Thank you again


(system) #4