We are having an issue with an issue with Ubuntu kernel causing the machine running elasticsearch to hung.
While the machine in this state (it only solve after restart) no API call to elasticsearch (or even SSH to the server) is returning result, though elasticsearch continues to identify the server as part of the cluster (because the way Zen discovery works). It means that the server not responding while the master does not remove it from the cluster.
There is no simple solution to this. Nodes that are completely working or completely broken are easy to identify, but in this case AIUI the faulty node is neither: it just repeatedly hangs for a while and then starts working again. Each time this happens it looks to the master node like a transient network issue. There is no "correct" way to deal with this to this, only heuristics, and Elasticsearch's heuristics are quite conservative in their detection of faulty nodes because removing a node and reallocating all its shards can be quite expensive.
Rolling back to a non-buggy kernel version is all I can suggest. I've had a quick look, but can't find details on when this bug was introduced, sorry. The discussion about this kernel issue all seems to be in the last month or so.