We are having an issue with an issue with Ubuntu kernel causing the machine running elasticsearch to hung.
While the machine in this state (it only solve after restart) no API call to elasticsearch (or even SSH to the server) is returning result, though elasticsearch continues to identify the server as part of the cluster (because the way Zen discovery works). It means that the server not responding while the master does not remove it from the cluster.
@dudidu
How many nodes in this cluster?
Can you share the elasticsearch.yml
It would be interesting to see what the discovery.zen.minimum_master_nodes is set to.
There is no simple solution to this. Nodes that are completely working or completely broken are easy to identify, but in this case AIUI the faulty node is neither: it just repeatedly hangs for a while and then starts working again. Each time this happens it looks to the master node like a transient network issue. There is no "correct" way to deal with this to this, only heuristics, and Elasticsearch's heuristics are quite conservative in their detection of faulty nodes because removing a node and reallocating all its shards can be quite expensive.
It seems that there is a known issue with ubuntu kernel causing the machine to hung, which currently does not have a fix (A kernel fix should be released in the next couple of weeks).
Rolling back to a non-buggy kernel version is all I can suggest. I've had a quick look, but can't find details on when this bug was introduced, sorry. The discussion about this kernel issue all seems to be in the last month or so.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.