Shrink unable to allocate index due to initial_recovery setting

We ran into a very strange issue with index shrink following recovery from a complete cluster network outage. The outage appears to have taken place in the middle of a shrink operation and following the restart the shrink index was created but unallocatable due to an incompatibility with out cluster allocation settings.

We run a large cluster with hot/warm nodes and do index rollover on ingest indices, so when an index has been filled it's rolled over, moved to a warm node and can then be shrunk (we don't currently use ILM for this as it's part of our wider index management/lifecycle tooling).

Prior to shrink the the index settings were:

$ curl 'http://127.0.0.1:9200/v46.agentfileevent@1m.2-000155/_settings?pretty&filter_path=*.settings.index.routing&flat_settings'
{
  "v46.agentfileevent@1m.2-000155" : {
    "settings" : {
      "index.routing.allocation.require._name" : "elasticsearch-47",
       "index.routing.allocation.require.warm" : "true"
    }
  }
}

and shards had been correctly moved to elasticsearch-47 (the warm node):

$ curl http://127.0.0.1:9200/_cat/shards/v46.agentfileevent@1m.2-000155
v46.agentfileevent@1m.2-000155 1 p STARTED 18623317 5.2gb 10.0.111.134 elasticsearch-hot-1
v46.agentfileevent@1m.2-000155 1 r STARTED 18623317 5.2gb 10.0.5.131   elasticsearch-47
v46.agentfileevent@1m.2-000155 2 p STARTED 18617985 5.1gb 10.0.111.134 elasticsearch-hot-1
v46.agentfileevent@1m.2-000155 2 r STARTED 18617985 5.1gb 10.0.5.131   elasticsearch-47
v46.agentfileevent@1m.2-000155 0 p STARTED 18624048 5.1gb 10.0.111.134 elasticsearch-hot-1
v46.agentfileevent@1m.2-000155 0 r STARTED 18624048 5.1gb 10.0.5.131   elasticsearch-47

It's already slight unusual here that replicas stay allocated on elasticsearch-hot-1 even though it doesn't match the index.routing.allocation.require rules, this behaviour seems to be consistent though, until relocation of those shards is triggered (e.g. by restarting the node) they stay allocated and the cluster stays green.

A shrink was then performed but at some point during the shrink the network outage happened. When the cluster started back up the newly created shrink index could not be allocated as it had a routing.allocation.initial_recovery that required it to be allocated on elasticsearch-hot-1

$ curl 'http://127.0.0.1:9200/v46.agentfileevent@1m.2-shrink-000155/_settings?pretty&filter_path=*.settings.index.routing&flat_settings'
{
  "v46.agentfileevent@1m.2-shrink-000155" : {
    "settings" : {
      "index.routing.allocation.initial_recovery._id" : "3QDvDT8PT3qoMJZh_JEVCA", <-- node ID of elasticsearch-hot-1
      "index.routing.allocation.require._name" : "elasticsearch-47",
       "index.routing.allocation.require.warm" : "true"
    }
  }
}

Unfortunately I don't have a copy of the cluster explain output, but it was complaining that the shard couldn't be allocated as no node could match all of routing constraints.

In the end I was able to delete the shrink index and re-run the shrink on the un-shrunk version which worked correctly, however I'm trying to understand how it might have got into this state. I've managed to reproduce the same shard allocation on a test setup, but shrinking/restarting/relocating nodes seems to work as expected.

Looking at the source for index shrink here: elasticsearch/MetadataCreateIndexService.java at 9c45dbcb8e20e4fbde6a4aac72f81e7092742395 · elastic/elasticsearch · GitHub I'd expect in this case with all shards available on two nodes that the recovery._id would be set to both node IDs, as a comma separated list, which is exactly what I see when running locally. I have no idea how it got into a state where the initial recovery id was only set to the incompatible hot node, especially as the index was never allocated there.

One thing that did look quite suspicious to me was that the settings builder in prepareResizeIndexSettings adds the initial_recovery setting and then copies the settings from the source index. Is it possible if the source index has the initial_recovery set then it will overwrite the setting in the shrunk index? I'm not entirely clear on when initial_recovery is actually configured, but this might explain how an index could get into this state.

Not sure if anyone here has experienced this before, or has any other ideas of what might have triggered it? It ended up being quite a severe outage as this required manual recovery of the index and the cluster was red while in this state.

What version are you using? ISTR that ILM became more intelligent in this area at some point.

This was on 7.10.0-oss at the moment, but also don't use ILM (x-pack licensing is an issue for us and we also have some requirements like data migration that can't be expressed in ILM but need to be done in conjunction with index management).

I'm not sure that ILM would particularly solve this problem though either, looking through the source for it, it's just calling the same APIs. I think the fact that when doing a shrink it will copy settings from the parent index that overwrite settings that are added as part of the shrink operation.

Indeed, but I think it will also abort the shrink if the chosen node leaves the cluster before the shrink completes, and retry on a different node. You'll need to implement similar logic yourself if you don't want to use ILM.

Ok, will look into this, thanks :+1:

1 Like