Shard Stuck in INITIALIZING and RELOCATING for more than 12 hours

I was talking about expired license of x-pack. I removed the following lines from the logs I shared.

[<time stamp>][WARN ][o.e.l.LicenseService     ] [es-d01-rm] 
#
# LICENSE [EXPIRED] ON [THURSDAY, JANUARY 01, 1970]. IF YOU HAVE A NEW LICENSE, PLEASE UPDATE IT.
# OTHERWISE, PLEASE REACH OUT TO YOUR SUPPORT CONTACT.
# 
# COMMERCIAL PLUGINS OPERATING WITH REDUCED FUNCTIONALITY
# - security
#  - Cluster health, cluster stats and indices stats operations are blocked
#  - All data operations (read and write) continue to work
# - watcher
#  - PUT / GET watch APIs are disabled, DELETE watch API continues to work
#  - Watches execute and write to the history
#  - The actions of the watches don't execute
# - monitoring
#  - The agent will stop collecting cluster and indices metrics
#  - The agent will stop automatically cleaning indices older than [xpack.monitoring.history.duration]
# - graph
#  - Graph explore APIs are disabled
# - ml
#  - Machine learning APIs are disabled
# - logstash
#  - Logstash will continue to poll centrally-managed pipelines
# - deprecation
#  - Deprecation APIs are disabled
# - upgrade
#  - Upgrade API is disabled

I have enabled the traces you asked for.
I will investigate the shard-failed error messages later on.
I got the timeout while using Dev Tools in Kibana.

Hmm, I wonder if the expired licence is pertinent. Is it just that node that reports an expired licence? Odd that it says it expired ON [THURSDAY, JANUARY 01, 1970]. I would dig into this further too.

I don't think the expired license has anything to do with this issue. We have disabled all x-pack features. We are seeing this log for a very long time now on every node and on other clusters as well.

After I enabled the logs, the only change I see is es-d01-rm is now logging this one line every time recovery is attempted.

[2018-12-18T01:24:12,453][TRACE][o.e.t.T.tracer ] [es-d01-rm] [1297722][internal:index/shard/recovery/start_recovery] received request

There are no new logs except the above on any node of interest.

Ok, at least the message is arriving. That's something.

The only way I can think of this happening right now is some interaction between the (supposedly-disabled) X-pack security plugin and the expired licence. I cannot find a path through the recovery handler that doesn't result in more log messages, but things like security do have the ability to block messages before they get to the handler. We do not have exhaustive tests for Elasticsearch's behaviour in the presence of an invalid licence, so I think the best way forward is to fix that to bring it closer to a supported configuration.

When I say disabled, I mean we have the following settings in elasticsearch.yml config file on all nodes.

xpack.ml.enabled: false
xpack.security.enabled: false
xpack.monitoring.enabled: false
xpack.graph.enabled: false
xpack.watcher.enabled: false
node.ml: false

These settings were added in Elasticsearch 5.6.10. Then we decided to do a rolling upgrade of all our clusters to 6.2.4. All the small clusters upgraded just fine. This cluster got stuck. We have two more clusters that have got themselves in this same state where some shards are stuck initializing. Logs are similar to this one as well. Hope this information helps in root causing the issue. In the meanwhile, I will try to install the basic license of x-pack (with all features still disabled) on all nodes of this cluster and see if that helps.

I installed a basic x-pack license. Expired license message is no longer getting logged but this did not help fix the initializing shards. I don't see any new trends in the logs as well.

Ok we just spotted this in your node stats:

				"generic": {
					"threads": 128,
					"queue": 832220,
					"active": 128,
					"rejected": 0,
					"largest": 128,
					"completed": 1802552
				},

I think this is an echo of the earlier time when node_concurrent_recoveries was set so high - see https://github.com/elastic/elasticsearch/issues/36195 for instance. The generic threadpool on es-d01-rm is completely exhausted and will likely not recover in a reasonable timeframe.

It would be useful if you could grab the hot threads from the cluster so we can just check it's not something else:

GET /_nodes/hot_threads?threads=9999

Then I would restart es-d01-rm and check for any other nodes with unreasonably full queues in their threadpools.

I checked the latest status of the generic thread pool on es-d01-rm and the queue size is 1262467! GET /_nodes/hot_threads is always timing out. Is it trying to get executed on es-d01-rm and just getting queued? Is there any other way I can capture this information?

We cannot restart this node right now because it has some primary shards which have only unassigned or initializing replica shards on other nodes. So, if we restart this node, the cluster will become red. We are currently re-indexing documents out of yellow indices into green indices. Once that is done, can we restart this node. Is there any other way of draining this queue without restarting the node?

I cannot think of one.

After we completed re-indexing data out of yellow indices, I started deleting those indices so that the cluster can get back to green state. After I deleted a particular index, the generic thread pool queue on es-d01-rm drained automatically without having to restart it. This would mean that the shard was actually stuck initializing due to some reason. Draining the thread pool just helped getting other operations executed on this node. This cluster has recovered now. We have two more clusters which show the same symptoms. They had multiple replica shards so I was able to restart data nodes with high generic thread pool queue size. Now, thread pool queues are under control on both clusters and recovery is in progress. I will share more logs if these clusters also get stuck in initializing state for further investigation.

One of the two blocked clusters recovered automatically after restarting the data nodes with large queues. No index deletion was required for it. I am now collecting logs of the only stuck cluster (does not have large thread pool queues) for your analysis.

OK, I am now left with one cluster where a handful of shards are stuck initialising. Initially all indices had 2 replica shards. Whenever I found a particular shard was stuck in recovery, I set the replica count of the corresponding index to 1 and the recovery process "got unblocked". I kept doing this until now. The number of unassigned shards is under control but still very high.

Value of cluster.routing.allocation.node_concurrent_recoveries is 1. Shards get stuck in translog stage during recovery but ES says there is nothing more to recover. Note that we are in the middle of upgrading Elasticsearch from 5.6.10 to 6.2.4 so not all nodes are upgraded yet. We are blocked on this issue. Also note that there are some indices in this cluster that were created by Elasticsearch 5.4.1. Not sure if this information is useful but mentioning it anyway. I was able to capture hot threads this time, so please have a look at that. It seems like it has useful information for diagnosing this problem. All this is captured in the gists below.

Queries: https://gist.github.com/bittusarkar/497e71c5db9c14200a587ae924805802

Logs: I collected logs from the master node es-m01-rm, from source data node es-d25-rm and target data node es-d38-rm for the recovery of shard [codesearchshared_11_0][22]. Over time, different sets of shards attempted recovery and all of them have similar logs - https://gist.github.com/bittusarkar/52b29da60c7fa4dda1ef51d29f9d71cf

Update: Since the time I originally posted this message, the generic thread pool queue on es-d25-rm has gone from 0 to ~377K. This implies that thread pool getting queued up is a symptom and not the cause of the issue. From the hot threads output, it seems like there is some bug in sequence Id based recovery due to which initializing shards are "waiting forever" in the translog stage.

The logs you provided indicate infrastructural problems. Firstly, one of the nodes does not have enough free space:

[2018-12-26T17:16:22,522][INFO ][o.e.c.r.a.DiskThresholdMonitor] [es-m01-rm] low disk watermark [85%] exceeded on [1jqD3BqWSoykJ59u85pm-g][es-d32-rm][F:\data\nodes\0] free: 525.5gb[12.8%], replicas will not be assigned to this node
[2018-12-26T17:16:54,563][INFO ][o.e.c.r.a.DiskThresholdMonitor] [es-m01-rm] low disk watermark [85%] exceeded on [1jqD3BqWSoykJ59u85pm-g][es-d32-rm][F:\data\nodes\0] free: 525.5gb[12.8%], replicas will not be assigned to this node
[2018-12-26T17:17:26,412][INFO ][o.e.c.r.a.DiskThresholdMonitor] [es-m01-rm] low disk watermark [85%] exceeded on [1jqD3BqWSoykJ59u85pm-g][es-d32-rm][F:\data\nodes\0] free: 525.5gb[12.8%], replicas will not be assigned to this node

Secondly, there are frequent shard failures with reason mark copy as stale:

[2018-12-26T17:17:51,302][WARN ][o.e.c.a.s.ShardStateAction] [es-m01-rm] [wikisearchshared_2_1311][5] received shard failed for shard id [[wikisearchshared_2_1311][5]], allocation id [_3J19V0fRye18YQk536ilQ], primary term [29], message [mark copy as stale]
[2018-12-26T17:17:51,302][WARN ][o.e.c.a.s.ShardStateAction] [es-m01-rm] [wikisearchshared_2_1311][2] received shard failed for shard id [[wikisearchshared_2_1311][2]], allocation id [8XCdjc23RrSYLhtMUPHNWw], primary term [26], message [mark copy as stale]
[2018-12-26T17:17:51,302][WARN ][o.e.c.a.s.ShardStateAction] [es-m01-rm] [wikisearchshared_2_1311][7] received shard failed for shard id [[wikisearchshared_2_1311][7]], allocation id [S1mP0W4FRf-MfRHA6f5YtA], primary term [26], message [mark copy as stale]

I think these both need to be properly addressed before this cluster can stabilise.

That's not what happened with [codesearchshared_11_0][22]. That recovery got as far as finalization, which indicates that it completed phase 2:

[2018-12-26T15:11:19,300][TRACE][o.e.i.r.RecoverySourceHandler] [es-d25-rm] [codesearchshared_11_0][22][recover to es-d38-rm] finalizing recovery

The hot threads dump indicates many threads on es-d25-rm waiting here, which occurs during finalization:

org.elasticsearch.index.seqno.GlobalCheckpointTracker.waitForLocalCheckpointToAdvance(GlobalCheckpointTracker.java:812)

It's possible this is a further consequence of the earlier deadlock/overload, perhaps exacerbated by a known issue in 6.2.4. If so, dropping the number of replicas down to 0 on any affected indices will help. Once the cluster is stable and you've resolved the mark copy as stale issue above you can start adding replicas back again.

Thank you for your response @DavidTurner.

I don't think the DiskThresholdMonitor logs are harmful in this context. Those nodes have only hit the low disk watermark and that should not impact initializing shards. Also, the node on which the low disk watermark is breached is not a participant in the recovery process of the said shard. We can ignore these logs.

I don't understand what the mark copy as stale message actually means. Is there more logging I can enable to understand better? I searched on the internet about this error but did not find anything useful. Also, we see this error sporadically on all our clusters, small or big. That leads me to believe these are transient errors. But if that is not the case, I would like to understand this better. Can you help here?

I don't understand what you meant by the above. That shard has still not been assigned; so it must have failed at some stage, no?

Regarding the above, do we know what was the "wait" about and why it would stay in that state for such a long time?

About the known issue you mentioned, the indices whose shards are stuck right now don't update their mapping during bulk indexing so I guess that is not what is happening here.

We want dropping the replica count to 0 to be our last resort for fixing the cluster. We still have one more cluster to upgrade and from what we have seen till now, risk of this issue re-occurring is very high. I will reduce the value of cluster.routing.allocation.node_concurrent_recoveries to the default 2 before upgrading though. But besides that, I still don't understand why shards are stuck initializing.

Sure. It means that Elasticsearch tried to replicate an operation on a shard (e.g. indexing a document) but one of the replicas was unavailable at the time, and the replica previously had been available and in-sync, so it has to be removed from the in-sync set and marked as stale. It normally happens if a node drops off the cluster, but the cluster hasn't reassigned all its shards elsewhere yet, and then you try and write to one of the missing shards. Being a WARN-level log this isn't expected to happen in healthy clusters.

I can't see any more useful logging to add for this beyond what we normally log about nodes joining/leaving the cluster and the cluster health being green.

Correct, it failed in the finalisation stage, not the translog stage as you had said:

It's waiting for the recovering shard's local checkpoint to exceed the global checkpoint, indicating that it's processed all operations in the translog and can now be marked as in-sync.

Looking again at the hot threads output it seems that the generic threadpool is completely full of threads that are stuck here, presumably blocking something that would actually cause any of these checkpoints to update. I've not investigated what. As far as I can tell there's one of these actions per recovery, so this also looks like a consequence of the earlier excess of recoveries.

It does look like removing the replicas from all affected indices will unblock these actions, as will restarting the node, but I can't say what other long-term effects there might be lurking in this cluster. It's got into a very bad state. I recommend a complete restart.

I fixed the cluster by reducing the replica count wherever required and then increased it back to 2 to get back to the previous good state. Thank you for all the help @DavidTurner.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.