Cluster takes too long to apply cluster state

Hi guys,

We have some 1Tb+ indices and it takes more than a minute to drop these indices when we rotate them. During the deletion cluster takes too long to apply cluster state and master nodes start to kick data nodes out of the cluster ending up in RED state with some primary shards disconnected.

We can tolerate delete operation to be long and put some queries in queue but want to avoid nodes being kicked out of the cluster.

Which settings can we tweak to avoid nodes to be kicked out when it takes too long to apply cluster state and nodes are busy with deleting the data?

What do you mean by "drop"?

What is the output from the _cluster/stats?pretty&human API?

Drop = Delete, I.e. curl -X DELETE "{endpoint/index}"

{
  "_nodes" : {
    "total" : 23,
    "successful" : 23,
    "failed" : 0
  },
  "cluster_name" : "es_cluster",
  "cluster_uuid" : "0PWha9VWQBCLhgBwryfg2Q",
  "timestamp" : 1679682126440,
  "status" : "green",
  "indices" : {
    "count" : 10,
    "shards" : {
      "total" : 196,
      "primaries" : 53,
      "replication" : 2.69811320754717,
      "index" : {
        "shards" : {
          "min" : 2,
          "max" : 120,
          "avg" : 19.6
        },
        "primaries" : {
          "min" : 1,
          "max" : 30,
          "avg" : 5.3
        },
        "replication" : {
          "min" : 1.0,
          "max" : 3.0,
          "avg" : 1.4
        }
      }
    },
    "docs" : {
      "count" : 134566730,
      "deleted" : 50773306
    },
    "store" : {
      "size" : "1.5tb",
      "size_in_bytes" : 1731111345325,
      "total_data_set_size" : "1.5tb",
      "total_data_set_size_in_bytes" : 1731111345325,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "6.8gb",
      "memory_size_in_bytes" : 7406660464,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "1.2gb",
      "memory_size_in_bytes" : 1378531580,
      "total_count" : 39295360141,
      "hit_count" : 6646100796,
      "miss_count" : 32649259345,
      "cache_size" : 1058237,
      "cache_count" : 78144276,
      "evictions" : 77086039
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 5186,
      "memory" : "190.4mb",
      "memory_in_bytes" : 199650108,
      "terms_memory" : "118.7mb",
      "terms_memory_in_bytes" : 124494296,
      "stored_fields_memory" : "6.6mb",
      "stored_fields_memory_in_bytes" : 6952672,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "12.1mb",
      "norms_memory_in_bytes" : 12784256,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "52.8mb",
      "doc_values_memory_in_bytes" : 55418884,
      "index_writer_memory" : "712mb",
      "index_writer_memory_in_bytes" : 746672836,
      "version_map_memory" : "10mb",
      "version_map_memory_in_bytes" : 10505100,
      "fixed_bit_set" : "177.1mb",
      "fixed_bit_set_memory_in_bytes" : 185788680,
      "max_unsafe_auto_id_timestamp" : 1679098436506,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "boolean",
          "count" : 23,
          "index_count" : 7,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 6,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 77,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 3,
          "index_count" : 1,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 27,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 202,
          "index_count" : 10,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 66,
          "index_count" : 7,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 11,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 106,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 137,
          "index_count" : 10,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [ ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [
        {
          "name" : "custom",
          "count" : 2,
          "index_count" : 2
        }
      ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [
        {
          "name" : "keyword",
          "count" : 2,
          "index_count" : 2
        }
      ],
      "built_in_filters" : [
        {
          "name" : "lowercase",
          "count" : 2,
          "index_count" : 2
        }
      ],
      "built_in_analyzers" : [
        {
          "name" : "standard",
          "count" : 2,
          "index_count" : 1
        }
      ]
    },
    "versions" : [
      {
        "version" : "7.5.0",
        "index_count" : 1,
        "primary_shard_count" : 1,
        "total_primary_size" : "52.5kb",
        "total_primary_bytes" : 53830
      },
      {
        "version" : "7.10.0",
        "index_count" : 2,
        "primary_shard_count" : 16,
        "total_primary_size" : "2.6mb",
        "total_primary_bytes" : 2791344
      },
      {
        "version" : "7.17.7",
        "index_count" : 7,
        "primary_shard_count" : 36,
        "total_primary_size" : "403.2gb",
        "total_primary_bytes" : 432985967198
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 23,
      "coordinating_only" : 3,
      "data" : 15,
      "data_cold" : 15,
      "data_content" : 15,
      "data_frozen" : 15,
      "data_hot" : 15,
      "data_warm" : 15,
      "ingest" : 0,
      "master" : 5,
      "ml" : 0,
      "remote_cluster_client" : 0,
      "transform" : 15,
      "voting_only" : 0
    },
    "versions" : [
      "7.17.7"
    ],
    "os" : {
      "available_processors" : 540,
      "allocated_processors" : 540,
      "names" : [
        {
          "name" : "Linux",
          "count" : 23
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Red Hat Enterprise Linux",
          "count" : 23
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 23
        }
      ],
      "mem" : {
        "total" : "11.3tb",
        "total_in_bytes" : 12436988469248,
        "free" : "4tb",
        "free_in_bytes" : 4470949167104,
        "used" : "7.2tb",
        "used_in_bytes" : 7966039302144,
        "free_percent" : 36,
        "used_percent" : 64
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 439
      },
      "open_file_descriptors" : {
        "min" : 830,
        "max" : 1998,
        "avg" : 1249
      }
    },
    "jvm" : {
      "max_uptime" : "64.1d",
      "max_uptime_in_millis" : 5538397280,
      "versions" : [
        {
          "version" : "19",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "19+36-2238",
          "vm_vendor" : "Oracle Corporation",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 23
        }
      ],
      "mem" : {
        "heap_used" : "261.5gb",
        "heap_used_in_bytes" : 280788898664,
        "heap_max" : "690gb",
        "heap_max_in_bytes" : 740881858560
      },
      "threads" : 2720
    },
    "fs" : {
      "total" : "13.4tb",
      "total_in_bytes" : 14810401996800,
      "free" : "11.7tb",
      "free_in_bytes" : 12867983949824,
      "available" : "11.7tb",
      "available_in_bytes" : 12867983949824
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 23
      },
      "http_types" : {
        "security4" : 23
      }
    },
    "discovery_types" : {
      "zen" : 23
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 23
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

Hmm this sounds like a possible bug, we shouldn't be blocking cluster state updates on deletes no matter how large the indices are. Could you follow the troubleshooting guide to work out what extra information to collect to dig deeper?

Hi David,

Thanks for the link, I'll get through the guide as much as we can afford to interrupt that environment.

Most "heath checkers" show that everything is within green range.

In the interim, can you recommend what can we tune up from this list?

I.e. how much we can safely bump cluster.publish.timeout ? Is cluster stuck until cluster state is updated from all nodes or it's a different behavior?

Without seeing the results of the troubleshooting it's not possible for me to say what, if anything, can be done to work around the problem.

Hi guys,

We had the same situation. Have 1Tb+ indices and we tried to remove every month them through DELETE /indexname_202301_* and then 1-2 nodes could out of the cluster. Then we switched to delete indices one by one. We used 7.12 and thought that in the new version 7.17 it was fixed

The main reason is deleting this big indices takes a lot of time for cluster and makes big i/o load on the disk subsystem

That doesn't really explain the problem tho, at least not in enough detail for us to think about fixing it.

We are going to add security to our cluster so we can reboot nodes/cluster as we want and can reproduce the issue. What should we enable in config in order to log all needed info?

The troubleshooting guide I linked above describes the steps to follow to capture what we need to understand the problem.

We use Elasticsearch 7.17
Can we use the same as in 8.6?

logger.org.elasticsearch.cluster.coordination.LagDetector: DEBUG
logger.org.elasticsearch.transport.TcpTransport: DEBUG
logger.org.elasticsearch.xpack.core.security.transport.netty4.SecurityNetty4Transport: DEBUG

Yes those loggers are unchanged in 7.17.

1 Like

Hi Denis,

Are there any issues with logging verbosity in DEBUG mode in your prod envs? I think I need to increase level too in order to debug this issue.

Will inform you in one week. Are going to delete many big indices soon

Hi Denis,

Any updates from your side by chance?

1 Like

Finally we have logs

Wow ok it looks like your disks are really slow here. To some extent you're also affected by these issues:

You can mitigate the first by setting thread_pool.fetch_shard_store.max: 1, and maybe that'll be enough to take the load off your disks. If it isn't, you'll need to improve your storage performance somehow.

Thanks, we will try to apply thread_pool.fetch_shard_store.max: 1

Hi Denis,

Did applying this parameter help?