Elastic Query Returning Results With doc_count Lower Than Expected


(Cody Burke) #1

Hi, So I am running a query to get the number of api calls in my logs.
When I use this query, the numbers come in too low

{
"query": {
"bool": {
  "must": [
    {
      "query_string": {
        "analyze_wildcard": true,
        "query": "*"
      }
    },
    {
      "range": {
        "@timestamp": {
          "gte": 1505102400000,
          "lte": 1505188799999,
          "format": "epoch_millis"
        }
      }
    }
  ],
  "must_not": []
}
  },
 "size": 0,
 "_source": {
"excludes": []
},
"aggs": {
     "2": {
  "terms": {
    "field": "api_call.keyword",
    "size": 150,
    "order": {
      "_count": "desc"
    }
  },
  "aggs": {
    "3": {
      "terms": {
        "field": "response.keyword",
        "size": 10,
        "order": {
          "_count": "desc"
        }
      }
    }
  }
}
}
}

As an example, when I run this, my top endpoint comes back as having 582,734 hits in a day.
When I search my logs for that endpoint, I get 597,076 hits.

My logs having more hits than elastic for a given endpoint appears to be consistent - even the % difference between elastic and the logs stays the same.

At this point you are probably thinking that there is a problem with how I am getting data into elastic, and that the missing documents are simply not in elastic. However, when I go into kibana, and filter for an endpoint, the number I get is the exact same as what I see in my logs. In addition, when I run the query, the total number of hits matches exactly the number of hits in my logs.

The query's response also has

"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,

appear before each result, so it does not seem like the issue discussed in https://www.elastic.co/guide/en/elasticsearch/reference/5.4/search-aggregations-bucket-terms-aggregation.html#search-aggregations-bucket-terms-aggregation-approximate-counts is the cause.

Does anyone know why I am seeing this discrepancy?


(Mark Harwood) #3

My guess is it's some tokenization discrepancy with the value used in the search request and the aggregatable value (normalizaition perhaps?).

Can you supply a reproducible example of any kind?


(Cody Burke) #4

Hi Mark,

So I think I figured it out.

My query returns

{
  "took": 412,
  "timed_out": false,
  "num_reduce_phases": 4,
  "_shards": {
    "total": 1895,
    "successful": 1895,
    "failed": 0
  },
  "hits": {
    "total": 2120635,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "2": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "3": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
             {
                "key": "200",
                "doc_count": 627050
              },
              {
                "key": "401",
                "doc_count": 31889
              },
              {
                "key": "400",
                "doc_count": 3539
              }
            ]
          },
          "key": "guest/accountInformation",
          "doc_count": 662478
        },
        {
          "3": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
              {
                "key": "200",
                "doc_count": 222928
              },
              {
                "key": "401",
                "doc_count": 23415
              },
              {
                "key": "400",
                "doc_count": 148
              }
            ]
          },
          "key": "message/myMessages",
          "doc_count": 246491
        }, ...

(there is more, but I don't think people would appreciate reading a 1000+ line query result)

I noticed that when I filter in "Kibana Discover" on "guest/accountInformation" I get two results -

guest/accountInformation and /guest/accountInformation, with 662,478 and 23,586 hits respectively.
Looking at other endpoints I see that a few instances of each one has this leading '/'.

These two numbers together add up to the total that I see in my logs. So I guess my question now changes to "Is there an easy way to remove leading /'s in existing documents?"


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.