I have a problem with our Elasticsearch 5.4.2 cluster.
We have 52 nodes of which 40 are data nodes. On our non-master data nodes tasks are accumulating and get stuck. A full cluster restart did not solve our problem, the accumulating tasks came back quickly. If we restart a node which we have a problem, it just transfers to another nodes like a disease.
The detailed task list shows entries like this one:
The cluster was completely restarted ~7 hours ago, so there are not too many hung tasks (~17k at this moment) but the application hangs after a couple minutes with ES read timeouts and nothing can be seen in the ES's logs. And the app did not change.
because we still don't know what happened and how this could be solved (apart from getting every possible load off from elasticsearch and several cluster restarts).
Please tell me if we can provide more info to help to sort this out.