Shards randomly going into "initializing"/"unassigned" state even when the cluster is in stable state

dinesh_gnanasamy · April 6, 2021, 12:34pm

Description : Even without any nodal disturbances (nodes leaving/joining), some shards go into INITIALIZING, UNASSIGNED states randomly over 24-48 hours.

Environment :
ES Version : 7.0.1
No of nodes : 12 (No specific roles - all nodes assume all roles)
Are indices static ? : 50% static indices, 50% rollover indices with rollover period starting from 2 hrs minimum
Deployment : All 12 ES nodes run as Kubernetes PODs
Metrics Collection : We use telegraf to collect metrics and store the same in InfluxDB for ES-specific metrics tracking

Index-Specific Configurations :

10 shards per index with 1 replica each

Observations so far :

Heap, Disk and Ram usage of the instances do not exceed anything more than 60% (from node stats)
No abnormal master elections seen
Kubernetes liveness probes do not fail (hence we can eliminate possibility of any node not responding to Kiubernetes and hence gets restarted by Kubernetes)
No nodal restarts
No abnormal load on master node
Checked if rollover of indices (destroy old indices and creating new indices) cause state-change in the shards - Not all rollover indices go into YELLOW state, less than 5% of total count of rollover indices get into YELLOW state. Sometimes they remain in YELLOW state after giving up all allocation-attempts - but even this pattern is not consistent - in few instances they also recover neatly and move to GREEN state over time

Clarifications :

While I understand that _cluster/allocation/explain?pretty" can provide insight on reason for UNASSIGNED state, my understanding was that any shard-state change is only triggered by nodal-state changes (joining / leaving etc.) - which is not the case here. Hence, what are other reasons which could trigger changes in shard-states ?
Checked if there are any abnormal logs on master - found none. If you could point to anything specific to look into master and/or other nodes, please let me know
Any specific cluster / nodal stats to be examined to correlate with this behaviour ?
More importantly, will replica-shards for fewer indices remaining in UNASSIGNED state impact queries on those indices ? As I understand , searches use Adaptive Replica Selection. and some replicas are UNASSIGNED, I assume that in such scenario, queries would hammer primary-shards more vigorously (or am I over-exaggerating?)
Pls correct me if I am wrong. If my assumption is right, please let me know of any search-metrics to correlate with ?

I can share the _cluster/allocation/explain?pretty" output, but wanted to understand what reasons other than nodal joins/leaves can cause the change in shard-states.

Thanks in advance

Dinesh

DavidTurner · April 6, 2021, 5:26pm

That's pretty old, long past EOL, so you should upgrade as a matter of urgency. I don't think I have a development environment for this version any more so detailed troubleshooting will be difficult.

I'm pretty sure all shard failures are logged, and the message will indicate the reason.

system · May 4, 2021, 5:27pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Shards UNASSIGNED even tho they exist on disk Elasticsearch	2	559	July 6, 2017
Unassigned shards status flip-flopping Elasticsearch	1	507	July 5, 2017
Reason behind shards being in UNASSIGNED state Elasticsearch	1	455	July 5, 2017
Unassigned shards Elasticsearch	3	524	July 6, 2017
Unassigned shards Elasticsearch	1	452	July 6, 2017

Shards randomly going into "initializing"/"unassigned" state even when the cluster is in stable state

Related topics