Elastic 5.6.4 allocated primary and replica shard to same data node causing ES health check issue

Recently for the first time we hit the scenario where ES health was yellow because of shards allocated to same data node

Summary:
ES Version: 5.6.4
Cluster has 14 nodes, 2 gw and 11 data and 1 utility node
New indexes were created at the same time(as always) but 2 indexes had issues
After digging through using API's we found the actual issue was for index logstash-cos-us-data shard 1, both primary and replica shard was assigned to same data node causing index to be unassigned

Shard API

gmahadevan@:~$ curl -XGET ':9200/_cat/shards?v' | grep -v STARTED
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
5 250k 5 14396 0 index shard prirep state docs store ip node
0 9285 0 0:00:27 0:00:01 0:00:26 9299logstash-cos-<index 1>-2018.04.30 1 r UNASSIGNED
logstash-cos-<index 2>-2018.04.30 3 r UNASSIGNED

Cluster allocation API showed

gmahadevan@virtmanagedal0501:~$ curl -XGET 'xxx.xxx.xxx254:9200/_cluster/allocation/explain'
{"index":"logstash-cos-index 1-2018.04.30","shard":1,"primary":false,"current_state":"unassigned","unassigned_info":{"reason":"INDEX_CREATED","at":"2018-04-30T00:00:06.896Z","last_allocation_status":"no_attempt"},"can_allocate":"yes","allocate_explanation":"can allocate the shard","target_node":{"id":"UVtLoxfpSReOCnXiB3xJhA","name":"<data node ‘h’>-instance1","transport_address":"xxx.xxx.xxx148:"},"node_allocation_decisions":[{"node_id":"I6EnD8PUS_i48NIg-aAhLA","node_name":"<data node ‘b’>-instance1","transport_address":"xxx.xxx.xxx154:","node_decision":"yes","weight_ranking":2},{"node_id":"GjwYijkPSNKuLFSgNgqyXQ","node_name":"<data node ‘a’>-instance1","transport_address":"xxx.xxx.xxx155:","node_decision":"yes","weight_ranking":3},{"node_id":"IigZq5X2Sj6evdzjdPYQcg","node_name":"<data node ‘j’>-instance1","transport_address":"xxx.xxx.xxx225:","node_decision":"yes","weight_ranking":4},{"node_id":"oNEWO5EzQgSfdKtfhQenMQ","node_name":"<data node ‘i’>-instance1","transport_address":"xxx.xxx.xxx234:","node_decision":"yes","weight_ranking":5},{"node_id":"TlL0PRKFTLqQMBS2Pqi_JQ","node_name":"<data node ‘k’>-instance1","transport_address":"xxx.xxx.xxx240:","node_decision":"yes","weight_ranking":6},{"node_id":"UVtLoxfpSReOCnXiB3xJhA","node_name":"<data node ‘h’>-instance1","transport_address":"xxx.xxx.xxx148:","node_decision":"yes","weight_ranking":8},{"node_id":"bdPx0_bhQKifI9-va7y9Cg","node_name":"<data node ‘f’>-instance1","transport_address":"xxx.xxx.xxx149:","node_decision":"yes","weight_ranking":9},{"node_id":"WpnSr2egR0ycllsdfwGjBA","node_name":"<data node ‘c’>-instance1","transport_address":"xxx.xxx.xxx153:","node_decision":"yes","weight_ranking":10},{"node_id":"58vUt7GYTEO3y5E-YmD_gA","node_name":"<data node ‘d’>-instance1","transport_address":"xxx.xxx.xxx152:","node_decision":"yes","weight_ranking":11},{"node_id":"iPJ0Zg-fQpq8uYgo8PLKJA","node_name":"<data node ‘d’>-instance1","transport_address":"xxx.xxx.xxx151:","node_decision":"throttled","weight_ranking":1,"deciders":[{"decider":"throttling","decision":"THROTTLE","explanation":"reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"}]},{"node_id":"JbZrajUuReqUlEve0yz8Yw","node_name":"<data node ‘g’>-instance1","transport_address":"xxx.xxx.xxx150:","node_decision":"no","weight_ranking":7,"deciders":[{"decider":"same_shard","decision":"NO","explanation":"the shard cannot be allocated to the same node on which a copy of the shard already exists [[logstash-cos-index 1-2018.04.30][1], node[JbZrajUuReqUlEve0yz8Yw], [P], s[STARTED], a[id=I86hQ9s1TiqmPQi3yUdBmQ]]"}]}]}

We had to re-route the replica to different data node to recover ES health but this is the first time in probably a year(using the same ES version for long time) we have seen this behavior. Any thoughts on what could be causing this issue

The message you highlighted say that the replica shard can not be allocated to the same node where a primary already exist on it. That said there are other nodes where it can go to (see the YES decision). The allocation is throttled because there are already two recoveries of initializing shards on-going on the node that was chosen. Once those recoveries are completed, the replica should be assigned.

If you need more info, please re-format the json using the preformatted text option. That will make it much easier to read.

Thanks Bleskes. On a similar note we are seeing an issue where shard routing happens and they get stuck in initializing state because ES allocates 2 replica shards to same data node simultaneously and we are seeing the errors as

"node_id" : "",
"node_name" : "",
"transport_address" : "1IP:port",
"node_decision" : "throttled",
"weight_ranking" : 1,
"deciders" : [
{
"decider" : "throttling",
"decision" : "THROTTLE",
"explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_inco
ming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
}

Do you suggest increasing the count from 2 to higher number? Is there an easy API call to do it and are there any adverse side-effect if we do increase it to bigger number. The shards just gets stuck in initializing state forever and we use the cancel API to cancel initialization so one of them gets assigned to different data node but it takes a lot of time to get to "STARTED" from "INITIALIZING REROUTE_CANCELLED" state. Please advise

These are not errors but rather indication that the allocation is throttled and delayed until existing recoveries have completed. This should not take that long, depending on the amount of data. You can increase the value from 2 but there is a risk as this is meant to protect the node from being overloaded with two many recoveries.

When you say it takes long to get to started, how long are we talking about?

Potentially an hour or two for a shard to get started. We are also in process of upgrading our cluster from 5.6.4 to 5.6.8 per node-to-node basis. So some data nodes are at 5.6.8 and others at 5.6.4. I understand we cannot have primary shard on data node at 5.6.8 and let replica reside on 5.6.4 data node but recovery of shard assignment takes even longer during upgrades. We are planning to have multi-ES instance per node at later point of time and I am wondering things to go worse when we plan to do later upgrades. Any insight/advise would be greatly helpful. Please let me know if you would specific API call outputs or more info about our cluster

Are you following this guide? Rolling upgrade | Elasticsearch Guide [8.11] | Elastic . Doing so would mean recovery should take a few seconds if not less.

multi-ES instance per node

What's you main motivation for this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.