Shards randomly going into "initializing"/"unassigned" state even when the cluster is in stable state

Description : Even without any nodal disturbances (nodes leaving/joining), some shards go into INITIALIZING, UNASSIGNED states randomly over 24-48 hours.

Environment :
ES Version : 7.0.1
No of nodes : 12 (No specific roles - all nodes assume all roles)
Are indices static ? : 50% static indices, 50% rollover indices with rollover period starting from 2 hrs minimum
Deployment : All 12 ES nodes run as Kubernetes PODs
Metrics Collection : We use telegraf to collect metrics and store the same in InfluxDB for ES-specific metrics tracking

Index-Specific Configurations :

  1. 10 shards per index with 1 replica each

Observations so far :

  • Heap, Disk and Ram usage of the instances do not exceed anything more than 60% (from node stats)
  • No abnormal master elections seen
  • Kubernetes liveness probes do not fail (hence we can eliminate possibility of any node not responding to Kiubernetes and hence gets restarted by Kubernetes)
  • No nodal restarts
  • No abnormal load on master node
  • Checked if rollover of indices (destroy old indices and creating new indices) cause state-change in the shards - Not all rollover indices go into YELLOW state, less than 5% of total count of rollover indices get into YELLOW state. Sometimes they remain in YELLOW state after giving up all allocation-attempts - but even this pattern is not consistent - in few instances they also recover neatly and move to GREEN state over time

Clarifications :

  1. While I understand that _cluster/allocation/explain?pretty" can provide insight on reason for UNASSIGNED state, my understanding was that any shard-state change is only triggered by nodal-state changes (joining / leaving etc.) - which is not the case here. Hence, what are other reasons which could trigger changes in shard-states ?

  2. Checked if there are any abnormal logs on master - found none. If you could point to anything specific to look into master and/or other nodes, please let me know

  3. Any specific cluster / nodal stats to be examined to correlate with this behaviour ?

  4. More importantly, will replica-shards for fewer indices remaining in UNASSIGNED state impact queries on those indices ? As I understand , searches use Adaptive Replica Selection. and some replicas are UNASSIGNED, I assume that in such scenario, queries would hammer primary-shards more vigorously (or am I over-exaggerating?)
    Pls correct me if I am wrong. If my assumption is right, please let me know of any search-metrics to correlate with ?

I can share the _cluster/allocation/explain?pretty" output, but wanted to understand what reasons other than nodal joins/leaves can cause the change in shard-states.

Thanks in advance

  • Dinesh

That's pretty old, long past EOL, so you should upgrade as a matter of urgency. I don't think I have a development environment for this version any more so detailed troubleshooting will be difficult.

I'm pretty sure all shard failures are logged, and the message will indicate the reason.

1 Like