Shard Stuck in INITIALIZING and RELOCATING for more than 12 hours

We have cluster with large number of shards across multiple indices. We recently upgraded from ES 5.6 -> 6.2.4. It was rolling upgrade as per instructions provided. After the upgrade now few of the shards are stuck in INITIALIZING and RELOCATION for more than 2 days.

Since we had 2 replicas at the time of upgrade, thought that load might be issue so we reduced replica to 1 for all indices but still some shards are stuck even after that, and I would only shards that were deleted were removed from the queue.

We also tried setting replica count to 0 and back to 1 for the indexes in yellow state but that did not help either

The current state is this for more than 12 hours with replica count 1 -

{
"cluster_name": "*****",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 11,
"number_of_data_nodes": 5,
"active_primary_shards": 10420,
"active_shards": 20412,
"relocating_shards": 54,
"initializing_shards": 66,
"unassigned_shards": 362,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 97.94625719769674
}

Cluster settings

{
"persistent": {},
"transient": {
"cluster": {
"routing": {
"rebalance": {
"enable": "all"
},
"allocation": {
"node_concurrent_recoveries": "60",
"enable": "all",
"exclude": {
"_ip": "192.168.0.155"
}
}
}
}
}
}

Please help in resolving or point me to places to look for the issues. The logs show these 2 set of logs every half an hour or so for lot of shards.

[es-m02-rm] [code_e0752061-1827-4652-912a-18e2b0f9282a][69] received shard failed for shard id [[code*_e0752061-1827-4652-912a-18e2b0f9282a][69]], allocation id [FbatnSmjTMejpeppJJSESQ], primary term [0], message [master {es-m02-rm}{HYF8DEbFSUurYCD7809SPQ}{bxd19cLvTHSWjs8L-_s_jw}{192.168.0.102}{192.168.0.102:9300}{faultDomain=0, updateDomain=2} has not removed previously failed shard. resending shard failure]

[es-d04-rm] [[code_e0752061-1827-4652-912a-18e2b0f9282a][69]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [code
_e0752061-1827-4652-912a-18e2b0f9282a][69]: Recovery failed from {es-d01-rm}{05PeyBBySq-qL0NVHwdVmw}{fGTsu8lHRNyuDlUglbjAlg}{192.168.0.151}{192.168.0.151:9300}{faultDomain=0, updateDomain=0} into {es-d04-rm}{xSBwpjSuSNm-lQrjyb-H1g}{XyW5oQhaR_6UJNn70svUXw}{192.168.0.154}{192.168.0.154:9300}{faultDomain=1, updateDomain=3} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:286) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 6 more

You have far too many shards in your cluster given that you only have 5 data nodes. Please have a look at this blog post for some practical guidance on shards and sharing practices. I recommend you rethink your sharding strategy and look to reduce this significantly.

That's pretty excessive, you need to do something about that super urgently.

I would start by setting replica's to 0, reindex into indices with less shards, then add the replicas back.

Thanks for the reply. While it is large number of shards, it is not possible immediately to reduce the shards as it is complex system with very complex routing system used to make sure queries ends up in single shard only.

While we rethink our strategy for sharding, is there a way to recover the cluster back to normal ? with 2 replicas, as this is critical data for our service.

also I checked and I don't see any disk space or heap space crunch. Not sure how to debug it further as logs do not say much apart from timeouts.Will adding few more nodes help? As i see disk space used is only 30% with 70% still free.

Add more nodes is really only your short term solution.

You should aim for <600-700 shards per node. Certainly not 4000.

Having that many shards can lead to a quite large cluster state that may be slow to update and distribute across the cluster. Given that the cluster state needs to be updated and distributed after every change and you are likely to have a lot of changes with that many shards, I am not surprised it is slow.

"node_concurrent_recoveries": "60"

This will be contributing to your pain. Set it back to the default and let the recoveries work through more sequentially. There are so many happening in parallel, slowing them all down, that each one is hitting indices.recovery.recovery_activity_timeout (30 minutes) and failing. If you do fewer at once then there's more chance of them succeeding.

I also echo the others' advice to reduce your shard count or increase your node count.

1 Like

Thanks, Setting it back to default vaules. I have also deleted some unused indexes we created for future growth, and that has brought down number of shards to 7000 from 20000. I am waiting for cluster to recover now...

will get back if I don't see progress...

There's zero value in creating indices before they need to have data in them.

We have increased the data node count from 5 to 10 and reduced the replica count to 1. We did that 24 hours back. Yet, shards are hardly relocating to the new data nodes. For almost all the unassigned/initializing shards, the corresponding primary shards are present in es-d01-rm data node. We have even tried excluding this node from shard allocation to force relocation to other nodes but that is also not working. Setting the replica count to 0 and then back to 1 did not help either. There is definitely something wrong with this data node but we cannot figure out what. Here are some commands we executed few minutes back to capture the current state. Let me know if you need more data to diagnose this issue.

We continue to keep getting the "no activity after [30m]" errors in the log file. See below:

[es-d08-rm] [[wiki_wikicontract_1421_shared_e9529131-b634-4873-b279-c9dd8a0bd2e5][7]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [wiki_wikicontract_1421_shared_e9529131-b634-4873-b279-c9dd8a0bd2e5][7]: Recovery failed from {es-d01-rm}{05PeyBBySq-qL0NVHwdVmw}{fGTsu8lHRNyuDlUglbjAlg}{192.168.0.151}{192.168.0.151:9300}{faultDomain=0, updateDomain=0} into {es-d08-rm}{EeSbzForSNeuewm02TeSlA}{7YAg4GuxRBORgeZ2ND0nDg}{192.168.0.158}{192.168.0.158:9300}{faultDomain=1, updateDomain=2} (no activity after [30m])
	at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:286) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
	at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
	... 6 more

The link you posted earlier suggests that you are still trying to do more concurrent recoveries than the default of 2:

          "explanation": "reached the limit of outgoing shard recoveries [10] on the node [EeSbzForSNeuewm02TeSlA] which holds the primary, cluster setting [cluster.routing.allocation.node_concurrent_outgoing_recoveries=10] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"

I still think you should set this to the default (or maybe even 1) to allow your cluster to stabilise in a more controlled fashion. Please do this and then monitor the progress of the remaining recoveries using GET /_recovery (or GET _cat/recovery).

If this doesn't help then I would be interested to see the full output of GET /_recovery, although it will be quite large. I think we will also need to see more of the logs than just the lines saying "recovery failed" to have a hope of investigating why these recoveries continue to time out.

I did set it to 1 yesterday but did not notice any change. Even shards with 0 documents were stuck in initializing state for hours. I have set it back to 1 now. I will check back after 12 hours and report with the output of GET _recovery. Is there any other information you think would be useful to capture?

I just checked the status and nothing has changed. There is one replica shard with 0 documents still initializing. This is the output of GET _recovery shortly after I set concurrent allocations to 1 and this one I captured a few minutes ago. Replica shard with Id 1 of index wiki_wikicontract_1421_shared_e9529131-b634-4873-b279-c9dd8a0bd2e5 is initializing on node es-d08-rm. The corresponding primary shard is allocated on es-d01-rm. Let me know if you need more information.

The recovery in question reports like this:

      {
        "id": 0,
        "type": "PEER",
        "stage": "INDEX",
        "primary": false,
        "start_time_in_millis": 1544782964186,
        "total_time_in_millis": 462847,
        "source": {
          "id": "05PeyBBySq-qL0NVHwdVmw",
          "host": "192.168.0.151",
          "transport_address": "192.168.0.151:9300",
          "ip": "192.168.0.151",
          "name": "es-d01-rm"
        },
        "target": {
          "id": "EeSbzForSNeuewm02TeSlA",
          "host": "192.168.0.158",
          "transport_address": "192.168.0.158:9300",
          "ip": "192.168.0.158",
          "name": "es-d08-rm"
        },
        "index": {
          "size": {
            "total_in_bytes": 0,
            "reused_in_bytes": 0,
            "recovered_in_bytes": 0,
            "percent": "0.0%"
          },
          "files": {
            "total": 0,
            "reused": 0,
            "recovered": 0,
            "percent": "0.0%"
          },
          "total_time_in_millis": 462808,
          "source_throttle_time_in_millis": 0,
          "target_throttle_time_in_millis": 0
        },
        "translog": {
          "recovered": 0,
          "total": -1,
          "percent": "-1.0%",
          "total_on_start": -1,
          "total_time_in_millis": 0
        },
        "verify_index": {
          "check_index_time_in_millis": 0,
          "total_time_in_millis": 0
        }
      },

The one from a few hours later looks similar. What's strange is that it's saying there's nothing to recover: no files, no bytes. I think this is a state that the recovery goes through while it's calculating the list of files to send, but it should only be in this state for a short while.

Since there's only one recovery in progress now I think it'd be ok to increase the log level and look at the fine detail of what's going on here:

PUT /_cluster/settings
{"persistent":{"logger.org.elasticsearch.indices.recovery":"TRACE"}}

Leave that set for a whole recovery (i.e. until you've seen two [failed recovery] messages) and then reset it with:

PUT /_cluster/settings
{"persistent":{"logger.org.elasticsearch.indices.recovery":null}}

Then please share the resulting logs.

I should add that the pertinent logs will be spread across the elected master node, the source node of the recovery and the target node of the recovery. Please share the logs from all three.

GET _cat/shards?v says the replica shard [wiki_wikicontract_1421_shared_e9529131-b634-4873-b279-c9dd8a0bd2e5][1] is initializing on node es-d08-rm but I see logs for the same shard in logs of node es-d10-rm as well. Not sure if this information is useful though. es-m02-rm is the master and the corresponding primary shard is on node es-d01-rm. I did not find any useful information related to shard recovery in logs of es-d01-rm. Please find the logs you requested at https://gist.github.com/bittusarkar/8b4adb4573642508324115aec86ff211. I turned on tracing at around 2018-12-15 12:50:00. Given that it did not produce much logs, I have not turned it off yet in case it is required for further investigation.

I see. This is deeply puzzling. Each recovery starts with a log like this on the target node:

[2018-12-15T13:23:13,409][TRACE][o.e.i.r.PeerRecoveryTargetService] [es-d08-rm] [wiki_wikicontract_1421_shared_e9529131-b634-4873-b279-c9dd8a0bd2e5][1] started recovery from {es-d01-rm}{05PeyBBySq-qL0NVHwdVmw}{fGTsu8lHRNyuDlUglbjAlg}{192.168.0.151}{192.168.0.151:9300}{faultDomain=0, updateDomain=0}, id [392]

The target node then sends a message to the source node which should log a corresponding starting recovery to message, or throw an exception, but either way we should see something in the logs. Yet, if the message didn't get through I would also expect that to be reported in the logs. Are you suppressing any log messages that would normally be recorded by default?

Can you check connectivity between these nodes? Can you try running GET _nodes/es-d01-rm/stats on es-d08-rm and vice versa?

The master logs contain some messages like these that also make me think you might have connectivity issues:

[2018-12-15T09:47:44,673][WARN ][o.e.c.a.s.ShardStateAction] [es-m02-rm] [codesearchshared_1_0][44] received shard failed for shard id [[codesearchshared_1_0][44]], allocation id [T-Cfsp_2SkeGgfjtZhfNKg], primary term [18], message [mark copy as stale]
[2018-12-15T10:02:13,011][WARN ][o.e.c.a.s.ShardStateAction] [es-m02-rm] [codesearchshared_3_0][45] received shard failed for shard id [[codesearchshared_3_0][45]], allocation id [Mmfo3AJsT-y3WEVdRRddQA], primary term [16], message [mark copy as stale]

Why are these shards failing? Normally there would be other messages indicating this, but in your log they are missing.

No, I have not suppressed any relevant logs. I just deleted the expired license logs before sharing them with you. Also, I don't see any obvious connectivity issues between es-d01-rm and other nodes in the cluster. Find the output of nodes stats commands at https://gist.github.com/bittusarkar/27eeecc2ed08687053a8516ef08df4a7. Another thing I am noticing now is that _refresh commands for indices which have primary shards allocated in es-d01-rm are timing out. For example, POST code_sourcenodedupefilecontractv3_1421_shared_edbe3bf1-20a2-4135-a810-477ce2c52f00/_refresh returns

{
  "statusCode": 504,
  "error": "Gateway Time-out",
  "message": "Client request timeout"
}

What license has expired?

Can we try some lower-level logging?

PUT _cluster/settings
{
   "transient" : {
      "logger.org.elasticsearch.transport.TransportService.tracer" : "TRACE",
      "transport.tracer.include" : "internal:index/shard/recovery/*"
   }
}

Reset this with

PUT _cluster/settings
{
   "transient" : {
      "logger.org.elasticsearch.transport.TransportService.tracer" : null,
      "transport.tracer.include" : null
   }
}

Also please could you look further into those shard-failed messages, perhaps checking the logs on the node holding the corresponding primary?

Also suspicious, and suggestive of connectivity issues, although the timeout you are seeing is not an Elasticsearch timeout since Elasticsearch will wait patiently for success or failure there and doesn't return a 504.