Unable to allocate replica shard after node restart

I have an Elasticsearch (v5.6.10) cluster with 3 nodes.

  • Node A : Master
  • Node B : Master + Data
  • Node C : Master + Data

There are 6 shards per data node with replication set as 1. All 6 primary nodes are in Node B and all 6 replicas are in Node C.

Now suddenly the Node B server got stopped and eventually all the shards in Node C got promoted as primary.

But once Node B came back up, the replica shards are not getting assigned to it. As a result my cluster state is now "Yellow", with 6 unassigned shards.

Also on checking the log in the master node, I found the below errors

[2020-08-04T10:55:05,319][INFO ][o.e.c.r.a.DiskThresholdMonitor] [Bd5E45c] low disk watermark [85%] exceeded on [UBCxWls9R0-WOBhEHMcT5A][UBCxWls][/opt/app/elasticsearch-5.6.10/data/nodes/0] free: 56gb[13.9%], replicas will not be assigned to this node
[2020-08-04T10:55:05,319][INFO ][o.e.c.r.a.DiskThresholdMonitor] [Bd5E45c] low disk watermark [85%] exceeded on [zYg-V-AhSx6J8HyCSsiO5g][zYg-V-A][/opt/app/elasticsearch/data/nodes/0] free: 47.7gb[12.1%], replicas will not be assigned to this node

UBCxWls is Node B
zYg-V-A is Node C

Please help me understand what options I have at this stage.

@DavidTurner You helped me understand https://discuss.elastic.co/t/stop-start-an-elasticsearch-instance-having-all-the-primary-shards/220028. This situation is kind of inline to that. Only the node shutdown was accidental.

This tells us your data nodes are getting full, that you have exceeded the low disk watermark threshold, which by default is set at 85% of the storage volume. In your case the node "UBCxWls" has used up 86.1% of its volume (13.9% remains free) and node "zYg-V-A" has used up 87.9% (12.1% is free).

When a data node crosses the low disk watermark, Elasticsearch will refuse to accept new shards on that node. This is why you can't get those shards assigned to any node - both of them are too full. See the official documentation here.

The best way to solve this situation is to add a third data node, either turn the current master into a data + master node or bring in a brand new dedicated data node. Then your cluster will have more overall storage space and can spread out the shards to lower the disk usage on the two existing data nodes.

Good luck!

1 Like

Thank you for the help.

If I can increase the space on the data nodes, will that work ? Do I need to restart the elastic search nodes for that ?

Also once everything is fine, the replica shards will get allocated to Node B. Does that mean elastic search will override the existing local copies in that node (Prior to the crash) with the latest ones ?

I'm not sure but I would give it a try, if Elasticsearch doesn't discover the increased storage volume dynamically you will have to restart each data node after the increase.

Once the data nodes have gotten more storage space the master will first assign the shards to their respective nodes and then replica shards will be synchronized with (updated from) the primary shards, So as long as all the primary shards are OK you'll get your replicas updated too and your cluster should return to green.

1 Like

Should find the new space easily as space changes all the time - you can also temporarily raise the watermark to like 90% to see if it'll allocate - once it does and creates a new replica shard, it'll delete the old local primary shard on a per index or shard basis (I forget). We just need it to do this fast enough to not hit the watermark again.

Adding space is better anyway as you will likely run out soon - plus just making your master node a data node will give you a nicer more balanced cluster.

1 Like

Thank you @Bernt_Rostad & @Steve_Mushero

Since Elasticsearch deletes old local primary shard after it adds the replica, there is a moment when it needs almost double storage. So, is there any way to delete these old shards beforehand ?

Also you mentioned

We just need it to do this fast enough to not hit the watermark again.

How to make it fast ?

And, yes, double storage, FOR THAT SHARD, and in theory it does it on a rolling basis; you can also throttle it, see below.

For speed, I mean 'fast' in the sense that it does the shard replacement fast enough to not go over the watermark. You can imagine where it'd start 20 shard copies at once and this would be slow and run out of space, where if it did one at at time (the throttling), it'd be fast and be rolling.

Easiest is still add lots more space ... as you'll need it anyway as you grow.

Else try setting watermark higher and see what happens - seems the default per node throttle on incoming shards is two at a time, so that may be good; you can set "cluster.routing.allocation.node_concurrent_incoming_recoveries" to 1 if you want and then change watermark, should only do one at a time.

1 Like

Thank you @Steve_Mushero. It worked. There is no way to mark two answers as solution. Sorry about that.

1 Like

Thank you @Bernt_Rostad.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.