Even after setting disk based thresholds, the node went out of disk space, because of which cluster went yellow.
Just before the node went out of disk space, there was a lot of relocation which happened on this cluster and many shards were moved by elasticsearch to this node.
Are disk based thresholds not respected by the relocation activities?
The cluster will stop creating new shards on a node when it hits the low watermark, so if the initial shard activity you saw pushed it over the low water mark, then once the initial shards had moved, then next few it tried to re-allocate would be stuck unallocated as there is no room for them anywhere, and hence the cluster state goes yellow.
In that link it is specified that the relocations are taken into account while calculating the disk space. So ideally multiple relocations would be assigned into node only if all the shards can actually fit into the node.
And in my case even after reaching both the low and high threshold, relocations continued and the actual disk space went to 0% after which all the shards which were on that node were marked as failed and the cluster went to yellow state.
The the only other thing I can think of is that the because the disk space is only polled every 30s by default, between successive polling events the reallocations we triggered and just overwhelmed it. Very difficult to say without pouring through the master log file.
Do you know what triggered the mass re-allocation in the first place? Was it a node failure? I know on one of the clusters I manage, it couldn't cope with a multiple node failure due to the amount of data on each node, so I have set up forced shard awareness to help prevent your scenario. The cluster will still go yellow, but the replica shards won't move onto other nodes and potentially eat all their remaining disk space, therefore allowing them to maintain indexing on their primary shards.
That seems like a possibility. But 30s polling is for checking if some node has breached its limit and needs migration. I am assuming before issuing an actual relocation command, it is ensured that the node on which the shard is being sent has enough free space. In case it doesn't, the shard is not sent to that node. (no documentation for this case)
In my case relocations were triggered when multiple nodes reached their high thresholds limit at similar time. I had faced a similar situation in the past too, at that time we had lost one node.
I don't know enough about the internal working to be able to answer that one, but the answer would lie in the master log file. You'd be able to see when the node was pushed over the low and high water marks as there is an event written to the log file at the time. All I do know from experience is that it won't stop the current relocation action, unless the shard source is lost. The behaviour I've noticed is that once a shard re-location starts, it continues until it is finished, and only then are checks carried out to see if it can do another one. I can't say with 100% certainty that this is how it behaves with the watermarks scenario you have, but it behaves like this when altering cluster settings mid relocation. For example, changing the concurrent.node.recovery setting from 4 to 2, while 4 are currently moving, won't suddenly abort two of the active re-location, instead it will wait until they have finished and then check to see how many are running and decide then whether or not it can start another one.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.