Missing aggregations field in response

Hi,

We ran into an error where a query to ElasticSearch that included aggregations didn't return the aggregations field in the response at all. So for a query like this:

{
    "size": 0,
    "query": {
        "bool": {
            "filter": [
                {
                    "range": {
                        "target_status_code": {
                            "gte": 400
                        }
                    }
                },
                {
                    "range": {
                        "timestamp": {
                            "gte": "2019-07-29T03:47:01",
                            "lte": "2019-07-29T03:49:01"
                        }
                    }
                }
            ]
        }
    },
    "aggs": {
        "key": {
            "terms": {
                "field": "request_key"
            }
        }
    }
}

We got a response like the following

{
    "took": 203,
    "timed_out": false,
    "_shards": {
        "total": 255,
        "successful": 254,
        "skipped": 254,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": 0,
        "hits": []
    }
}

Note that there's not aggregations field in the response. What is odd is that this only happened twice. The query runs periodically, and the response is usually something like this:

{
  "took" : 158,
  "timed_out" : false,
  "_shards" : {
    "total" : 255,
    "successful" : 255,
    "skipped" : 254,
    "failed" : 0
  },
  "hits" : {
    "total" : 0,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "key" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [ ]
    }
  }
}

So even though there's no data, the field exists in the response (which is something that the code using it relies on). I couldn't find any reference to a behavior like this in ES' documentation, and it seems more like a bug? But I'm trying to understand if this is expected behavior that the code should be able to handle.

We're on version 6.7.2 of ES, let me know if there's more context I can provide.

Hmm, I think the reason is the shard responses:

"total": 255,
"successful": 254,
"skipped": 254,

vs

"total": 255,
"successful": 255, //<-- this
"skipped": 254,

E.g. the second response has one more "successful" shard than the incorrect response.

We have an optimization which skips entire shards if they can't possibly answer the query/aggregation, which is often leveraged by time-based data with ranges. It's sort of a pre-execution phase called "can-match".

What I think is happening is that the first response gets a "can-match: no" response from all shards and they all get skipped... and no response is generated. The second response has a single shard which doesn't get skipped and so a response is generated... although it happens to be empty.

I'm not sure how this is happening, we have logic in there to make sure at least one shard response is generated exactly because we have to generate an aggregation response, even if the hits are empty. This feels like a bug to me. Would you mind filing a ticket? You can /cc me in it and I'll take a look next week

Some questions:

  • Do you use a timeout on your searches?
  • Was this going through cross-cluster-search?
  • Anything notable in the logs, like an error or a shard being unavailable or something?
  • You said it only happened twice, so I'm assuming you don't have a more minimal example that we can use to replicate it? Figured it couldn't hurt to ask :slight_smile:

Thanks!

Sorry for the delay in the response - here's the issue on Github :slight_smile: https://github.com/elastic/elasticsearch/issues/45559