ILM slows down node recovery/rolling upgrade process

jerrac · March 29, 2024, 8:12pm

On my test cluster, I started a rolling upgrade. It's a 3 node cluster, so the cluster state went promptly to red. Once my upgraded node started up, I found that the index reallocation/recovery process was taking a lot longer than it should. As in minutes per shard.

I also noticed a lot of ILM related errors in the logs.

Putting two and two together, I stopped ILM and the recovery process started speeding up after a bit. I'm guessing it had to wait for the ILM jobs that were active to fully stop.

Which leaves me wondering, why did ILM even try to do anything when the cluster was red? Even if that wasn't a sign to stop, I'd think finding that the shards needed for the job are unallocated or not recovered would trigger a quick short circuit. (I'm assuming the logs I'm seeing are related to the unallocated shards, I haven't actually confirmed that.) Why would it keep trying to do enough work that allocation would slow down so drastically?

For older version of ES there were instructions about disabling shard reallocation to speed up recovery during a rolling upgrade, but those instructions seem to be gone. As far as I can tell there are no special rolling upgrade instructions anymore.

So I don't think I'm missing anything, but am I?

Or is this just something that happens, and the solution is to remember to stop ILM before upgrading?

DavidTurner · March 30, 2024, 8:41am

It sounds like you expected this, but that's not right. The cluster health should be yellow or green, never red during a rolling restart/upgrade. You must have had some unassigned shards, or shards without replicas, to cause that.

Red health just means that at least one primary shard is unassigned. That's not a reason for ILM to stop, and indeed we want ILM to keep on trying even on the unhealthy index so that it can proceed ASAP when the index returns to health.

Manually stopping ILM shouldn't be necessary during a rolling upgrade. Could you share the output of GET _cluster/stats?pretty please?

DavidTurner · March 30, 2024, 9:56am

There's still a recommended process to follow for a rolling upgrade (effectively the same as a rolling restart). But disabling ILM is not part of this process.

jerrac · March 31, 2024, 4:57pm

Thanks to lack of resources, I have ES configured to only have a single primary shard for each index. So that would explain the red status. (I originally tried to only have 1 node as well, but restarts on that got slow really fast due to needing to open more indexes than normal.)

Though, I did think taking 1 node out of a 3 node cluster would make it go red as well.

By the time I'm back in the office, it'll have been a while. Is it possible to load the stats for a specific time period?

Ok, sheesh. I looked for that doc repeatedly over the past few versions and somehow missed it. So I should still stop reallocation between nodes.

DavidTurner · March 31, 2024, 6:44pm

By "single primary shard" do you mean number_of_replicas: 0? In which case are you sure a rolling upgrade makes sense? It's quite a complex sequence of steps designed to maintain availability throughout, but that requires replicas - if you are ok with a period of unavailability then it'd be a good deal simpler (and possibly quicker) to upgrade all the nodes at the same time.

No, but the current stats will still be useful to see.

jerrac · April 1, 2024, 9:53pm

Any place I can send the stats that isn't public? I'm feeling overly cautious today.

DavidTurner · April 2, 2024, 5:48am

You can send me a DM if you'd prefer.

jerrac · April 2, 2024, 4:02pm

Er, you have DM's turned off.

DavidTurner · April 2, 2024, 5:37pm

Not sure what's wrong with my DMs but we connected eventually. I don't see anything particularly concerning unfortunately. A decent shard count but not ludicrously high, and a pretty recent version:

  "indices": {
    "count": 7626,
    "shards": {
      "total": 7700,
      "primaries": 7626,
...
    "versions": [
      "8.12.0"
    ],

I think if ILM really is slowing things down so much when the cluster has many unassigned primaries then we would consider that a bug. It's not a scenario that we cover in any benchmarks today AFAIK so it wouldn't surprise me if there were a performance bug of this nature. Pretty tricky to know how to fix it tho. Maybe collecting GET _nodes/_master/hot_threads?threads=9999 repeatedly while the cluster is recovering would help shed some light on the problem.

DavidTurner · April 9, 2024, 5:51pm

I looked at some hot threads captured during recovery (also shared via DM) but I didn't see anything obviously related to ILM in there. It caught one snapshot that's perhaps a little suspicious:

       java.base@21.0.2/java.lang.Throwable.<init>(Throwable.java:273)
       java.base@21.0.2/java.lang.Exception.<init>(Exception.java:67)
       java.base@21.0.2/java.lang.RuntimeException.<init>(RuntimeException.java:63)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.ElasticsearchException.<init>(ElasticsearchException.java:134)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.ResourceAlreadyExistsException.<init>(ResourceAlreadyExistsException.java:26)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.ResourceAlreadyExistsException.<init>(ResourceAlreadyExistsException.java:21)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.metadata.MetadataCreateIndexService.validateIndexName(MetadataCreateIndexService.java:177)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.metadata.MetadataCreateIndexService.validate(MetadataCreateIndexService.java:1370)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequest(MetadataCreateIndexService.java:345)
       org.elasticsearch.xpack.downsample.TransportDownsampleAction$3.execute(TransportDownsampleAction.java:800)
       org.elasticsearch.xpack.downsample.TransportDownsampleAction$1.executeTask(TransportDownsampleAction.java:136)
       org.elasticsearch.xpack.downsample.TransportDownsampleAction$1.executeTask(TransportDownsampleAction.java:132)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.SimpleBatchedExecutor.execute(SimpleBatchedExecutor.java:70)

Maybe that's ILM triggering an unnecessary downsample action? Not sure. I think I'd like to see debug logs from the MasterService (i.e. set logger.org.elasticsearch.cluster.service.MasterService: DEBUG) to get a sense of what cluster state updates it's processing. Maybe there's substantially more of them when ILM is running than otherwise.

jerrac · April 9, 2024, 6:32pm

So, that'd be the logs during the recovery, right?

Is that an elasticsearch.yml setting, or somewhere else?

DavidTurner · April 9, 2024, 7:37pm

You can put it in elasticsearch.yml but normally you'd set it as a persistent cluster setting. See these docs for more info.

DavidTurner · May 7, 2024, 7:06pm

@jerrac thanks for sharing the debug logs from another cluster restart this morning. I can't see the symptoms you were originally asking about - the master appears to be starting up one or two batches of shards every second, and it only takes 15 min or so to bring all ~7k indices back up from cold.

jerrac · May 7, 2024, 8:04pm

Thanks for checking. I did see a few spots where the cat/health output was saying it was taking minutes to do an action (reallocation I think? Can't remember the meaning of those columns of the top of my head...) But it did eventually get back up to green. So I'm just going to let this one go.

system · June 4, 2024, 8:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unassigned ilm-history shard cluster in RED state Elasticsearch ilm-index-lifecycle-management	2	1511	September 28, 2020
Ilm-history-2-000001 UNASSIGNED, primary shard for this replica is not yet active Elasticsearch	4	1463	August 31, 2020
Elastic Replica and Unassigned Shards Elasticsearch ilm-index-lifecycle-management	12	1984	April 19, 2024
ILM stalled by waiting for shard to be active Elasticsearch ilm-index-lifecycle-management	4	1805	August 11, 2021
Index management and rollover Elasticsearch	35	4050	December 6, 2019

ILM slows down node recovery/rolling upgrade process

Related topics