ES 2.4 to 5.2 Upgrade Followed By Major Cluster Instability

@mstruve - any script usage on that cluster (Painless)?

@Itamar_Syn_Hershko We use scripts in two places:

  1. One expression script when we compile reports nightly it is a single call and we use it to calculate the time between two fields and aggregate on that.
  2. We use a single painless script on occasion to allow users to update_by_query some fields. This is maybe used once or twice a week and there is a good chance it hasn't even been used on the new cluster by anyone.

Please start your own thread . for this :slight_smile:

OK folks, it looks like we have found a solution for our stability issue.

After sending ES a heap dump from one of our spikes they were able to find a bug that affects filters in aggregations. In filtered aggs like the one below the bug causes the filter to be run as a bool query and it is therefore scored. This is a big issue for us bc we make A LOT of these.

{
  "agg_name":{
    "filter":{
      "terms":{
        "id":[1,2,...]
      }
    }
  }
}

The temp fix is to use constant_score like this to avoid having the filter scored.

{ 
  "agg_name": {
    "filter": {
     "constant_score": {
       "filter": {
         "terms": {
            "id": []
          }
        }
      }
   }
}

That change looks to have completely stabilized us. The change went out at noon and this is how the graphs look pre and post change. Healthy saw tooth. We will be monitoring it closely and repost if anything changes. Thanks everyone for the help!

5 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.