Get tasks hang and accumulate on 5.4.2


I have a problem with our Elasticsearch 5.4.2 cluster.
We have 52 nodes of which 40 are data nodes. On our non-master data nodes tasks are accumulating and get stuck. A full cluster restart did not solve our problem, the accumulating tasks came back quickly. If we restart a node which we have a problem, it just transfers to another nodes like a disease.
The detailed task list shows entries like this one:

         "node" : "Q_pZKTu4R9-wldTlrPsLcA",
          "id" : 29589541,
          "type" : "netty",
          "action" : "indices:data/read/get",
          "description" : "",
          "start_time_in_millis" : 1499508299639,
          "running_time_in_nanos" : 16858865861967,
          "cancellable" : false

The problem first surfaced after we upgraded the cluster to 5.4. We have never had it before.

Does your application updates a document before it tries to retrieve it with the Get API?

No, it's just a get after a search.

I've asked this because the realtime get has changed recently. It will do a refresh before getting the document if it changed.

But since you are not updating it, then it is not clear to me why the task list is filling with get action.

Maybe @jasontedor knows more?

I think that we need to see output from the hot threads API and also the output of /_nodes/stats?filter_path=**.thread_pool.get.

The cluster was completely restarted ~7 hours ago, so there are not too many hung tasks (~17k at this moment) but the application hangs after a couple minutes with ES read timeouts and nothing can be seen in the ES's logs. And the app did not change.

Hot threads:


BTW, for search and other operations task api shows which index is used. Get doesn't. Is it intentional?
Maybe it's relevant which index has the problem (if it can be narrowed to one).

After a really complete restart (master nodes too), the accumulation of tasks seems to have stopped. Anyway, I would really be a lot calmer if I had known what caused this phenomenon.

I've opened an issue from this:

because we still don't know what happened and how this could be solved (apart from getting every possible load off from elasticsearch and several cluster restarts).
Please tell me if we can provide more info to help to sort this out.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.