Recently I am having trouble keeping an Elasticsearch cluster working, the cluster seems to fail somewhere during the night. All queries submitted to the cluster result in timeouts, so it seems nodes 'hang' in a some state not propagating requests anymore.
After a manual restart of a (single) node the cluster becomes working again. Logs are very 'scattered' and do not point me to a cause directly. I have consulted with my hosting provider and they indicate there have not been any changes in hardware recently and our cluster has been running on the same hardware for about half a year now.
This behaviour stared to show after we did a rolling upgrade to 6.3.0 (from 6.2.4), but not immediately afterwards (a couple of days later).
Our nodes (we have 3 of them, so fairly small cluster) occupy about 50% of the allocated heap and there is no sign of a memory overflow. This is also the case for storage, about 50% is used. Nodes idle around 20% CPU usage so again no indication of needing more hardware.
I am basically stuck here and would greatly appreciate some pointers in where to look/search to resolve this issue. I can supply logs if needed