Aggregation fails silently when ran on huge data, instead of error returns the default response of 10 docs and hits

Stack details -

ES version - 7.10.1
Kibana version - 7.10.1
Java high level client version - 7.2.0

I have a production cluster of 4 nodes where huge amounts of data is stored in one of the index (approx 30M+ docs, each doc has 10 fields in it). And in one of the apps we run a aggregation query on this data that fetches a lot of buckets in one go.

Usually the number of docs that match the query range between 300K to 1M, and when this is the case, the aggregation works fine. but when the number of docs matching the query goes beyond 20M, this aggregation query just fails silently, does not give any error message (like previously we used to get max_bucket exception, then we raised the limit to 100K and that is not observed again) but instead just returns the default response like if I searched with a query like - GET /{{index_name}}/_search

This is causing a problem in our application, as if it returned a error then application can do something about it, but it just returns some other response.

I want to know what are situations when this can happen that ES will return a default response like that instead of an error.

Following are some more info (dummy) on index and the kind of query we run -
index name - some_index

query -

GET /some_index/_search
{
  "size": 0, 
  "query": {
    "bool": {
      "must": [
        {
          "term": {
            "id_1": {
              "value": "1258"
            }
          }
        },
        {
          "terms": {
            "status": [
              "status1",
              "status2"
            ]
          }
        }
      ]
    }
  },
  "aggs": {
    "topLevelField": {
      "terms": {
        "field": "string_field",
        "size": 100000
      },
      "aggs": {
        "groupingBy": {
          "terms": {
            "field": "string_id_field",
            "size": 100000
          },
          "aggs": {
            "topHitsDocs": {
              "top_hits": {
                "size": 1
              }
            }
          }
        }
      }
    }
  }
}

Here the cardinality of top level agg field is around 40K at worst case and that of inner aggregating term string_id_field is around 30K in worst case scenarios. And in one more nested aggregation we fetch the top hit doc from each bucket.

So I am not able to understand if this query becomes too heavy for ES then it should return an error or something. But instead it seems to return a default response which makes things harder and I can not understand why it would do this? Any insights or help would be appreciated. Thanks.

1 Like

Can you share an example of the response which you see as wrong? (Full JSON but you can leave out fields of docs)

It's the default response I would get when I issued a query like - GET /index_name/_search

This is the sample response -

{
  "took" : 89,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "some_index",
        "_type" : "_doc",
        "_id" : "de36e7ca-0cea-49b4-9174-a570b21fd3c7",
        "_score" : 1.0,
        "_source" : {
          "field1" : {
            "id" : 10027151
          },
          "field2" : {
            "id" : "147"
          },
          "field3" : "3884979a-6697-4194-8b07-4aad9250bd94",
          "field4" : null,
          "field5" : {
            "id" : "11797975"
          },
          "field6" : false,
          "field7" : {
            "id" : "20"
          },
          "field8" : "non",
          "field9" : true,
          "field10" : "Success",
          "field11" : null,
          "field12" : 1620940454000,
          "field13" : 3,
          "field14" : "abc",
          "field15" : {
            "id" : "57876"
          },
          "field16" : "abcd",
        }
      },
      ....
      .......
      .........9 more such docs
    ]
  }

Thanks. That response doesn’t seem to tie up with your example query though which had size=0 meaning no hits expected the response

Yes that is the weird part. I expected error or some other behaviour but the other response I got which is not the query I sent surprised me. Any idea why this might happen?

I'd need these 3 things to comment further:

  1. The JSON of the request
  2. The JSON of the response to 1)
  3. What you were expecting to see in the response but didn't

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.