All shards being reassigned from one of our Warm nodes

Good day all.

We have noticed a few days ago that all shards from one of our warm data nodes are being rebalanced and reassigned to the other warm nodes. We are currently at a loss at finding the reason.

The node itself is not low on space. There is a mapping on the server (which is present on all nodes on the cluster) that points to the corporate NAS for the snapshot repositories which is pretty high on used space but I fail to see how this would be in cause?

Size Used Avail Use%
47G 0 47G 0%
47G 0 47G 0%
47G 147M 47G 1%
47G 0 47G 0%
10G 3.9G 6.2G 39%
244M 158M 87M 65%
250M 9.9M 240M 4%
30T 601G 29T 3%
2.2T 634M 2.2T 1%
2.0G 34M 2.0G 2%
4.0G 579M 3.5G 15%
2.0G 50M 2.0G 3%
10G 2.0G 8.1G 20%
1014M 33M 982M 4%
105T 86T 20T 82% <----- REPO MAPPING
9.3G 0 9.3G 0%
9.3G 0 9.3G 0%

We have tried forcing reallocation and saw some shards coming back to the node and being reallocated to the other nodes immediately after.

Only one node is currently showing this behaviour, all hot nodes are running with no issues. So far only one Warm node is being emptied of its shards. We are also noticing that we alwats have 20 shards being moved TO the malfuntioning node but for some reason the shard count keeps going down.

We are running ES v6.8.3.

Would anyone be kind enough as to give us a few hints to look into? Many thanks for your time.

Since this looks like a disk watermark threshold issue but are not seeing any problems with disk space on the node we have tried rebooting said node.

Upon return of the node to the cluster the shard count started going up quite rapidly on the node but after a while began going down again.

We are noticing a lot of these messages in the node's log :

failed to execute global checkpoint sync
org.elasticsearch.index.shard.ShardNotInPrimaryModeException: CurrentState[STARTED] shard is not in primary mode

Well we let things run for a while and the shard count started going back up again and never went down. Seems like a reboot was needed after all. What will be left to do is figure out what caused the initial problem.

We'd need to see logs from the node in question I think :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.