I have a problem with our Elasticsearch 5.4.2 cluster.
We have 52 nodes of which 40 are data nodes. On our non-master data nodes tasks are accumulating and get stuck. A full cluster restart did not solve our problem, the accumulating tasks came back quickly. If we restart a node which we have a problem, it just transfers to another nodes like a disease.
The detailed task list shows entries like this one:
The cluster was completely restarted ~7 hours ago, so there are not too many hung tasks (~17k at this moment) but the application hangs after a couple minutes with ES read timeouts and nothing can be seen in the ES's logs. And the app did not change.
BTW, for search and other operations task api shows which index is used. Get doesn't. Is it intentional?
Maybe it's relevant which index has the problem (if it can be narrowed to one).
After a really complete restart (master nodes too), the accumulation of tasks seems to have stopped. Anyway, I would really be a lot calmer if I had known what caused this phenomenon.
because we still don't know what happened and how this could be solved (apart from getting every possible load off from elasticsearch and several cluster restarts).
Please tell me if we can provide more info to help to sort this out.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.