ILM slows down node recovery/rolling upgrade process

On my test cluster, I started a rolling upgrade. It's a 3 node cluster, so the cluster state went promptly to red. Once my upgraded node started up, I found that the index reallocation/recovery process was taking a lot longer than it should. As in minutes per shard.

I also noticed a lot of ILM related errors in the logs.

Putting two and two together, I stopped ILM and the recovery process started speeding up after a bit. I'm guessing it had to wait for the ILM jobs that were active to fully stop.

Which leaves me wondering, why did ILM even try to do anything when the cluster was red? Even if that wasn't a sign to stop, I'd think finding that the shards needed for the job are unallocated or not recovered would trigger a quick short circuit. (I'm assuming the logs I'm seeing are related to the unallocated shards, I haven't actually confirmed that.) Why would it keep trying to do enough work that allocation would slow down so drastically?

For older version of ES there were instructions about disabling shard reallocation to speed up recovery during a rolling upgrade, but those instructions seem to be gone. As far as I can tell there are no special rolling upgrade instructions anymore.

So I don't think I'm missing anything, but am I?

Or is this just something that happens, and the solution is to remember to stop ILM before upgrading?

It sounds like you expected this, but that's not right. The cluster health should be yellow or green, never red during a rolling restart/upgrade. You must have had some unassigned shards, or shards without replicas, to cause that.

Red health just means that at least one primary shard is unassigned. That's not a reason for ILM to stop, and indeed we want ILM to keep on trying even on the unhealthy index so that it can proceed ASAP when the index returns to health.

Manually stopping ILM shouldn't be necessary during a rolling upgrade. Could you share the output of GET _cluster/stats?pretty please?

There's still a recommended process to follow for a rolling upgrade (effectively the same as a rolling restart). But disabling ILM is not part of this process.

Thanks to lack of resources, I have ES configured to only have a single primary shard for each index. So that would explain the red status. (I originally tried to only have 1 node as well, but restarts on that got slow really fast due to needing to open more indexes than normal.)

Though, I did think taking 1 node out of a 3 node cluster would make it go red as well.

By the time I'm back in the office, it'll have been a while. Is it possible to load the stats for a specific time period?

Ok, sheesh. I looked for that doc repeatedly over the past few versions and somehow missed it. So I should still stop reallocation between nodes.

By "single primary shard" do you mean number_of_replicas: 0? In which case are you sure a rolling upgrade makes sense? It's quite a complex sequence of steps designed to maintain availability throughout, but that requires replicas - if you are ok with a period of unavailability then it'd be a good deal simpler (and possibly quicker) to upgrade all the nodes at the same time.

No, but the current stats will still be useful to see.

1 Like

Any place I can send the stats that isn't public? I'm feeling overly cautious today.

You can send me a DM if you'd prefer.

Er, you have DM's turned off.

Not sure what's wrong with my DMs but we connected eventually. I don't see anything particularly concerning unfortunately. A decent shard count but not ludicrously high, and a pretty recent version:

  "indices": {
    "count": 7626,
    "shards": {
      "total": 7700,
      "primaries": 7626,
...
    "versions": [
      "8.12.0"
    ],

I think if ILM really is slowing things down so much when the cluster has many unassigned primaries then we would consider that a bug. It's not a scenario that we cover in any benchmarks today AFAIK so it wouldn't surprise me if there were a performance bug of this nature. Pretty tricky to know how to fix it tho. Maybe collecting GET _nodes/_master/hot_threads?threads=9999 repeatedly while the cluster is recovering would help shed some light on the problem.

I looked at some hot threads captured during recovery (also shared via DM) but I didn't see anything obviously related to ILM in there. It caught one snapshot that's perhaps a little suspicious:

       java.base@21.0.2/java.lang.Throwable.<init>(Throwable.java:273)
       java.base@21.0.2/java.lang.Exception.<init>(Exception.java:67)
       java.base@21.0.2/java.lang.RuntimeException.<init>(RuntimeException.java:63)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.ElasticsearchException.<init>(ElasticsearchException.java:134)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.ResourceAlreadyExistsException.<init>(ResourceAlreadyExistsException.java:26)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.ResourceAlreadyExistsException.<init>(ResourceAlreadyExistsException.java:21)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.metadata.MetadataCreateIndexService.validateIndexName(MetadataCreateIndexService.java:177)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.metadata.MetadataCreateIndexService.validate(MetadataCreateIndexService.java:1370)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.metadata.MetadataCreateIndexService.applyCreateIndexRequest(MetadataCreateIndexService.java:345)
       org.elasticsearch.xpack.downsample.TransportDownsampleAction$3.execute(TransportDownsampleAction.java:800)
       org.elasticsearch.xpack.downsample.TransportDownsampleAction$1.executeTask(TransportDownsampleAction.java:136)
       org.elasticsearch.xpack.downsample.TransportDownsampleAction$1.executeTask(TransportDownsampleAction.java:132)
       app/org.elasticsearch.server@8.13.2/org.elasticsearch.cluster.SimpleBatchedExecutor.execute(SimpleBatchedExecutor.java:70)

Maybe that's ILM triggering an unnecessary downsample action? Not sure. I think I'd like to see debug logs from the MasterService (i.e. set logger.org.elasticsearch.cluster.service.MasterService: DEBUG) to get a sense of what cluster state updates it's processing. Maybe there's substantially more of them when ILM is running than otherwise.

So, that'd be the logs during the recovery, right?

Is that an elasticsearch.yml setting, or somewhere else?

You can put it in elasticsearch.yml but normally you'd set it as a persistent cluster setting. See these docs for more info.