Description : Even without any nodal disturbances (nodes leaving/joining), some shards go into INITIALIZING, UNASSIGNED states randomly over 24-48 hours.
Environment :
ES Version : 7.0.1
No of nodes : 12 (No specific roles - all nodes assume all roles)
Are indices static ? : 50% static indices, 50% rollover indices with rollover period starting from 2 hrs minimum
Deployment : All 12 ES nodes run as Kubernetes PODs
Metrics Collection : We use telegraf to collect metrics and store the same in InfluxDB for ES-specific metrics tracking
Index-Specific Configurations :
- 10 shards per index with 1 replica each
Observations so far :
- Heap, Disk and Ram usage of the instances do not exceed anything more than 60% (from node stats)
- No abnormal master elections seen
- Kubernetes liveness probes do not fail (hence we can eliminate possibility of any node not responding to Kiubernetes and hence gets restarted by Kubernetes)
- No nodal restarts
- No abnormal load on master node
- Checked if rollover of indices (destroy old indices and creating new indices) cause state-change in the shards - Not all rollover indices go into YELLOW state, less than 5% of total count of rollover indices get into YELLOW state. Sometimes they remain in YELLOW state after giving up all allocation-attempts - but even this pattern is not consistent - in few instances they also recover neatly and move to GREEN state over time
Clarifications :
-
While I understand that
_cluster/allocation/explain?pretty"
can provide insight on reason forUNASSIGNED
state, my understanding was that any shard-state change is only triggered by nodal-state changes (joining / leaving etc.) - which is not the case here. Hence, what are other reasons which could trigger changes in shard-states ? -
Checked if there are any abnormal logs on master - found none. If you could point to anything specific to look into master and/or other nodes, please let me know
-
Any specific cluster / nodal stats to be examined to correlate with this behaviour ?
-
More importantly, will replica-shards for fewer indices remaining in
UNASSIGNED
state impact queries on those indices ? As I understand , searches use Adaptive Replica Selection. and some replicas areUNASSIGNED
, I assume that in such scenario, queries would hammer primary-shards more vigorously (or am I over-exaggerating?)
Pls correct me if I am wrong. If my assumption is right, please let me know of any search-metrics to correlate with ?
I can share the _cluster/allocation/explain?pretty"
output, but wanted to understand what reasons other than nodal joins/leaves can cause the change in shard-states.
Thanks in advance
- Dinesh