Elasticsearch has accumulated a lot of pending tasks, and stop indexing

ShanYang · August 23, 2023, 12:57am

Phenomenon: When generating a new index across days, elasticsearch stop indexing. The cluster health show green but with many pending task.

Used Get _tasks?, I saw thousands transport task and hundreds direct task.

Cluster recover after killing master node. It seems that the master node is stuck. Everything returned to normal when the master node was changed.

Is any suggestion about this?

DavidTurner · August 23, 2023, 5:42am

Sounds like a bug. What version are you running?

ShanYang · August 23, 2023, 5:58am

version: 7.17.2

DavidTurner · August 23, 2023, 5:59am

Can you share the output of GET _tasks?detailed and GET _cluster/pending_tasks from the time of the problem?

ShanYang · August 23, 2023, 7:50am

I'm not sure if providing only the first few tasks is enough. From my perspective, there are plenty of tasks that look like this one. As for the pending tasks, I forgot to note them down.
Hope this can help.

{
"nodes" : {
"XXX" : {
"name" : "es-cluster-tier1-2",
"transport_address" : "172.24.155.178:9300",
"host" : "172.24.155.178",
"ip" : "172.24.155.178:9300",
"roles" : [
"data_content",
"data_hot",
"ingest",
"master",
"ml",
"remote_cluster_client",
"transform"
],
"attributes" : {
"ml.machine_memory" : "8589934592",
"ml.max_open_jobs" : "512",
"xpack.installed" : "true",
"ml.max_jvm_size" : "4294967296",
"transform.node" : "true"
},
"tasks" : {
"XXX:110229137" : {
"node" : "XXX",
"id" : 110229137,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1692748957757,
"running_time_in_nanos" : 1391395119423,
"cancellable" : false,
"headers" : { }
},
"XXX:110229136" : {
"node" : "XXX",
"id" : 110229136,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1692748957704,
"running_time_in_nanos" : 1391447831580,
"cancellable" : false,
"headers" : { }
},
"XXX:110220947" : {
"node" : "XXX",
"id" : 110220947,
"type" : "transport",
"action" : "indices:data/write/bulk[s]",
"start_time_in_millis" : 1692748860676,
"running_time_in_nanos" : 1488476466847,
"cancellable" : false,
"parent_task_id" : "XXX:110220946",
"headers" : { }
},
...

DavidTurner · August 23, 2023, 9:14am

Not really, no, we'd need to see the full output of both APIs. Without the pending tasks I don't think we can answer your questions. Please capture that if it happens again.

ShanYang · August 25, 2023, 9:25am

ok, I will keep watch because it is unpredictable when this phenomenon will occur.

ShanYang · September 5, 2023, 6:07am

it happend again.
The pending task are as belows:

insertOrder timeInQueue priority source

  42855       55.2m NORMAL   cluster_reroute(reroute after starting shards)
  42880       55.1m NORMAL   ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]
  42909       55.1m NORMAL   ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]
  42940       55.1m NORMAL   ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]

the following tasks are similar to ilm-execute-cluster-state-steps task.

So it seems cause by cluster_reroute?

Get tasks
action task_id parent_task_id type start_time timestamp running_time

  indices:data/write/bulk	_Yn1U8jOSvWhgfaEzIVlIg:79310960	-	transport	1.69388E+12	02:31:12	50.4m

  indices:data/write/bulk	jKFbZ3SNSBWFI6JWUv5L7g:85644667	-	transport	1.69388E+12	02:31:14	50.3m

  indices:data/write/bulk	_Yn1U8jOSvWhgfaEzIVlIg:79311961	-	transport	1.69388E+12	02:31:16	50.3m

  indices:data/write/bulk	_Yn1U8jOSvWhgfaEzIVlIg:79312451	-	transport	1.69388E+12	02:31:18	50.3m

task detail
"_Yn1U8jOSvWhgfaEzIVlIg:79310960" : {
"node" : "_Yn1U8jOSvWhgfaEzIVlIg",
"id" : 79310960,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1693881072568,
"running_time_in_nanos" : 3524128366857,
"cancellable" : false,
"headers" : { }
},
"jKFbZ3SNSBWFI6JWUv5L7g:85644667" : {
"node" : "jKFbZ3SNSBWFI6JWUv5L7g",
"id" : 85644667,
"type" : "transport",
"action" : "indices:data/write/bulk",
"start_time_in_millis" : 1693881074669,
"running_time_in_nanos" : 3522028629897,
"cancellable" : false,
"headers" : { }
},

DavidTurner · September 5, 2023, 6:40am

Can you share the full output of GET _tasks?detailed and GET _cluster/pending_tasks from the time of the problem?

ShanYang · September 5, 2023, 6:45am

I sorted the pending task by insertOrder, so the following task i think is not related. (and the rest tasks are all same with "ilm-execute-cluster-state-steps [{"phase":"new","action":"init","name":"init"} => {"phase":"new","action":"complete","name":"complete"}]", except insertOrder/timeInQueue.

And I add tasks?detailed at previous reply.

DavidTurner · September 5, 2023, 6:55am

Sorry, you need to share the complete output to the APIs I indicated.

DavidTurner · September 6, 2023, 7:35am

You're still not using GET _cluster/pending_tasks - the output you have shared is from GET _cat/pending_tasks. The difference is important.

I think it would also help to see the output from GET _nodes/_master/hot_threads?threads=9999 while the cluster is in the problematic state.

ShanYang · September 6, 2023, 9:16am

I was wrong, sorry. It seems like I'll have to collect it again next time.

system · October 4, 2023, 9:17am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster pending_tasks - what do they mean? Elasticsearch	3	6662	July 5, 2017
Elasticsearch pending_tasks Elasticsearch	11	1689	October 29, 2018
Pending tasks queue Elasticsearch	8	3416	July 5, 2017
Clear cluster pending tasks Elasticsearch	3	8221	July 5, 2017
Elasticsearch cluster have millions of pending tasks Elasticsearch	15	1208	June 8, 2021

Elasticsearch has accumulated a lot of pending tasks, and stop indexing

Related topics