We are running ES 5.6.4 on Java 1.8.0_171.
Lately we can observe that linux process is stucked in zombie state on one VM (not always the same one). At that time, ES cluster doesn't throw any errors, doesn't process requests at all, everything is hanging and from client side we receive a lot of timeout exceptions from whole cluster. The only solution is to reboot linux OS on node where process is in zombie state. At that point cluster recovers.
The issue seems to be on linux side but the main question here is:
why is the whole cluster down when only one instance(only data node) has issues?