Elasticsearch has accumulated a lot of pending tasks, and stop indexing

Phenomenon: When generating a new index across days, elasticsearch stop indexing. The cluster health show green but with many pending task.

Used Get _tasks?, I saw thousands transport task and hundreds direct task.

Cluster recover after killing master node. It seems that the master node is stuck. Everything returned to normal when the master node was changed.

Is any suggestion about this?

1 Like

Sounds like a bug. What version are you running?

version: 7.17.2

Can you share the output of GET _tasks?detailed and GET _cluster/pending_tasks from the time of the problem?

I'm not sure if providing only the first few tasks is enough. From my perspective, there are plenty of tasks that look like this one. As for the pending tasks, I forgot to note them down.
Hope this can help.

{
"nodes" : {
"XXX" : {
"name" : "es-cluster-tier1-2",
"transport_address" : "172.24.155.178:9300",
"host" : "172.24.155.178",
"ip" : "172.24.155.178:9300",
"roles" : [
"data_content",
"data_hot",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"attributes" : {
"ml.machine_memory" : "8589934592",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "4294967296",
"transform.node" : "true"
},
"tasks" : {
"XXX:110229137" : {
"node" : "XXX",
"id" : 110229137,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1692748957757,
"running_time_in_nanos" : 1391395119423,
"cancellable" : false,
"headers" : { }
},
"XXX:110229136" : {
"node" : "XXX",
"id" : 110229136,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1692748957704,
"running_time_in_nanos" : 1391447831580,
"cancellable" : false,
"headers" : { }
},
"XXX:110220947" : {
"node" : "XXX",
"id" : 110220947,
"type" : "transport",
"action" : "indices:data/write/bulk[s]",
"start_time_in_millis" : 1692748860676,
"running_time_in_nanos" : 1488476466847,
"cancellable" : false,
"parent_task_id" : "XXX:110220946",
"headers" : { }
},
...

Not really, no, we'd need to see the full output of both APIs. Without the pending tasks I don't think we can answer your questions. Please capture that if it happens again.

ok, I will keep watch because it is unpredictable when this phenomenon will occur.

it happend again.
The pending task are as belows:

insertOrder timeInQueue priority source

  42855       55.2m NORMAL   cluster_reroute(reroute after starting shards)
  42880       55.1m NORMAL   ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]
  42909       55.1m NORMAL   ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]
  42940       55.1m NORMAL   ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]

the following tasks are similar to ilm-execute-cluster-state-steps task.

So it seems cause by cluster_reroute?

Get tasks
action task_id parent_task_id type start_time timestamp running_time

  indices:data/write/bulk	_Yn1U8jOSvWhgfaEzIVlIg:79310960	-	transport	1.69388E+12	02:31:12	50.4m

  indices:data/write/bulk	jKFbZ3SNSBWFI6JWUv5L7g:85644667	-	transport	1.69388E+12	02:31:14	50.3m

  indices:data/write/bulk	_Yn1U8jOSvWhgfaEzIVlIg:79311961	-	transport	1.69388E+12	02:31:16	50.3m

  indices:data/write/bulk	_Yn1U8jOSvWhgfaEzIVlIg:79312451	-	transport	1.69388E+12	02:31:18	50.3m

task detail
"_Yn1U8jOSvWhgfaEzIVlIg:79310960" : {
"node" : "_Yn1U8jOSvWhgfaEzIVlIg",
"id" : 79310960,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1693881072568,
"running_time_in_nanos" : 3524128366857,
"cancellable" : false,
"headers" : { }
},
"jKFbZ3SNSBWFI6JWUv5L7g:85644667" : {
"node" : "jKFbZ3SNSBWFI6JWUv5L7g",
"id" : 85644667,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1693881074669,
"running_time_in_nanos" : 3522028629897,
"cancellable" : false,
"headers" : { }
},

Can you share the full output of GET _tasks?detailed and GET _cluster/pending_tasks from the time of the problem?

I sorted the pending task by insertOrder, so the following task i think is not related. (and the rest tasks are all same with "ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]", except insertOrder/timeInQueue.

And I add tasks?detailed at previous reply.

Sorry, you need to share the complete output to the APIs I indicated.

You're still not using GET _cluster/pending_tasks - the output you have shared is from GET _cat/pending_tasks. The difference is important.

I think it would also help to see the output from GET _nodes/_master/hot_threads?threads=9999 while the cluster is in the problematic state.

I was wrong, sorry. It seems like I'll have to collect it again next time.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.