Recently for the first time we hit the scenario where ES health was yellow because of shards allocated to same data node
Summary:
ES Version: 5.6.4
Cluster has 14 nodes, 2 gw and 11 data and 1 utility node
New indexes were created at the same time(as always) but 2 indexes had issues
After digging through using API's we found the actual issue was for index logstash-cos-us-data shard 1, both primary and replica shard was assigned to same data node causing index to be unassigned
Shard API
gmahadevan@:~$ curl -XGET ':9200/_cat/shards?v' | grep -v STARTED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
5 250k 5 14396 0 index shard prirep state docs store ip node
0 9285 0 0:00:27 0:00:01 0:00:26 9299logstash-cos-<index 1>-2018.04.30 1 r UNASSIGNED
logstash-cos-<index 2>-2018.04.30 3 r UNASSIGNED
Cluster allocation API showed
gmahadevan@virtmanagedal0501:~$ curl -XGET 'xxx.xxx.xxx254:9200/_cluster/allocation/explain'
{"index":"logstash-cos-index 1-2018.04.30","shard":1,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"INDEX_CREATED","at":"2018-04-30T00:00:06.896Z","last_allocation_status":"no_attempt"},"can_allocate":"yes","allocate_explanation":"can allocate the shard","target_node":{"id":"UVtLoxfpSReOCnXiB3xJhA","name":"<data node ‘h’>-instance1","transport_address":"xxx.xxx.xxx148:"},"node_allocation_decisions":[{"node_id":"I6EnD8PUS_i48NIg-aAhLA","node_name":"<data node ‘b’>-instance1","transport_address":"xxx.xxx.xxx154:","node_decision":"yes","weight_ranking":2},{"node_id":"GjwYijkPSNKuLFSgNgqyXQ","node_name":"<data node ‘a’>-instance1","transport_address":"xxx.xxx.xxx155:","node_decision":"yes","weight_ranking":3},{"node_id":"IigZq5X2Sj6evdzjdPYQcg","node_name":"<data node ‘j’>-instance1","transport_address":"xxx.xxx.xxx225:","node_decision":"yes","weight_ranking":4},{"node_id":"oNEWO5EzQgSfdKtfhQenMQ","node_name":"<data node ‘i’>-instance1","transport_address":"xxx.xxx.xxx234:","node_decision":"yes","weight_ranking":5},{"node_id":"TlL0PRKFTLqQMBS2Pqi_JQ","node_name":"<data node ‘k’>-instance1","transport_address":"xxx.xxx.xxx240:","node_decision":"yes","weight_ranking":6},{"node_id":"UVtLoxfpSReOCnXiB3xJhA","node_name":"<data node ‘h’>-instance1","transport_address":"xxx.xxx.xxx148:","node_decision":"yes","weight_ranking":8},{"node_id":"bdPx0_bhQKifI9-va7y9Cg","node_name":"<data node ‘f’>-instance1","transport_address":"xxx.xxx.xxx149:","node_decision":"yes","weight_ranking":9},{"node_id":"WpnSr2egR0ycllsdfwGjBA","node_name":"<data node ‘c’>-instance1","transport_address":"xxx.xxx.xxx153:","node_decision":"yes","weight_ranking":10},{"node_id":"58vUt7GYTEO3y5E-YmD_gA","node_name":"<data node ‘d’>-instance1","transport_address":"xxx.xxx.xxx152:","node_decision":"yes","weight_ranking":11},{"node_id":"iPJ0Zg-fQpq8uYgo8PLKJA","node_name":"<data node ‘d’>-instance1","transport_address":"xxx.xxx.xxx151:","node_decision":"throttled","weight_ranking":1,"deciders":[{"decider":"throttling","decision":"THROTTLE","explanation":"reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"}]},{"node_id":"JbZrajUuReqUlEve0yz8Yw","node_name":"<data node ‘g’>-instance1","transport_address":"xxx.xxx.xxx150:","node_decision":"no","weight_ranking":7,"deciders":[{"decider":"same_shard","decision":"NO","explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[logstash-cos-index 1-2018.04.30][1], node[JbZrajUuReqUlEve0yz8Yw], [P], s[STARTED], a[id=I86hQ9s1TiqmPQi3yUdBmQ]]"}]}]}
We had to re-route the replica to different data node to recover ES health but this is the first time in probably a year(using the same ES version for long time) we have seen this behavior. Any thoughts on what could be causing this issue