We have cluster with large number of shards across multiple indices. We recently upgraded from ES 5.6 -> 6.2.4. It was rolling upgrade as per instructions provided. After the upgrade now few of the shards are stuck in INITIALIZING and RELOCATION for more than 2 days.
Since we had 2 replicas at the time of upgrade, thought that load might be issue so we reduced replica to 1 for all indices but still some shards are stuck even after that, and I would only shards that were deleted were removed from the queue.
We also tried setting replica count to 0 and back to 1 for the indexes in yellow state but that did not help either
The current state is this for more than 12 hours with replica count 1 -
{
"cluster_name": "*****",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 11,
"number_of_data_nodes": 5,
"active_primary_shards": 10420,
"active_shards": 20412,
"relocating_shards": 54,
"initializing_shards": 66,
"unassigned_shards": 362,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 97.94625719769674
}
Cluster settings
{
"persistent": {},
"transient": {
"cluster": {
"routing": {
"rebalance": {
"enable": "all"
},
"allocation": {
"node_concurrent_recoveries": "60",
"enable": "all",
"exclude": {
"_ip": "192.168.0.155"
}
}
}
}
}
}
Please help in resolving or point me to places to look for the issues. The logs show these 2 set of logs every half an hour or so for lot of shards.
[es-m02-rm] [code_e0752061-1827-4652-912a-18e2b0f9282a][69] received shard failed for shard id [[code*_e0752061-1827-4652-912a-18e2b0f9282a][69]], allocation id [FbatnSmjTMejpeppJJSESQ], primary term [0], message [master {es-m02-rm}{HYF8DEbFSUurYCD7809SPQ}{bxd19cLvTHSWjs8L-_s_jw}{192.168.0.102}{192.168.0.102:9300}{faultDomain=0, updateDomain=2} has not removed previously failed shard. resending shard failure]
[es-d04-rm] [[code_e0752061-1827-4652-912a-18e2b0f9282a][69]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.indices.recovery.RecoveryFailedException: [code_e0752061-1827-4652-912a-18e2b0f9282a][69]: Recovery failed from {es-d01-rm}{05PeyBBySq-qL0NVHwdVmw}{fGTsu8lHRNyuDlUglbjAlg}{192.168.0.151}{192.168.0.151:9300}{faultDomain=0, updateDomain=0} into {es-d04-rm}{xSBwpjSuSNm-lQrjyb-H1g}{XyW5oQhaR_6UJNn70svUXw}{192.168.0.154}{192.168.0.154:9300}{faultDomain=1, updateDomain=3} (no activity after [30m])
at org.elasticsearch.indices.recovery.RecoveriesCollection$RecoveryMonitor.doRun(RecoveriesCollection.java:286) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_72]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_72]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_72]
Caused by: org.elasticsearch.ElasticsearchTimeoutException: no activity after [30m]
... 6 more