Elasticsearch 8.17.3: Stuck "update_tsdb_data_stream_end_times" Pending Task Blocking ILM Policy

bpaoli · May 16, 2025, 11:58am

Environment

Elasticsearch 8.17.3 in a 3-node cluster
SLES-OS 12 VMs
JDK 21.0.5+11-JRE (issue also occurred with embedded JDK (java 23) and Java 17)
30GB partition for Elasticsearch data
64GB RAM per VM with JVM settings: -Xms8g -Xmx8g
SSL communication between nodes on one subnet, client communication on another subnet

Configuration

ILM policy “ilm_policy” for our main index "my_index_syslog":

{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "20m",
            "max_size": "10gb"
          }
        }
      },
      "delete": {
        "min_age": "0m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

ILM poll interval: indices.lifecycle.poll_interval: 1m

Usage Pattern

Two Logstash 8.17.3 clients continuously ingesting logs (up to 3000+ logs/second)
Clients query the index every 15 seconds
Approximately 50 additional client requests every 15 seconds

Issue

I've observed recently a persistent URGENT priority pending task "update_tsdb_data_stream_end_times" that never resolves. This is blocking the ILM policy tasks, preventing rollovers, and causing the storage partition to fill up. Important note: The issue appears anywhere between 5 hours and 1 week after installing Elasticsearch on the VMs.
Here's an example of the _cluster/pending_tasks output when the disk partition is not full:

{
  "tasks": [
    {
      "insert_order": 7465,
      "priority": "URGENT",
      "source": "update_tsdb_data_stream_end_times",
      "executing": false,
      "time_in_queue_millis": 23574471,
      "time_in_queue": "6.5h"
    },
    {
      "insert_order": 7463,
      "priority": "NORMAL",
      "source": "ilm-set-step-info {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2025.04.22-000001], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing": false,
      "time_in_queue_millis": 23633734,
      "time_in_queue": "6.5h"
    },
    {
      "insert_order": 7464,
      "priority": "NORMAL",
      "source": "ilm-set-step-info {policy [ilm_policy], index [my_index_syslog -000155], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing": false,
      "time_in_queue_millis": 23633734,
      "time_in_queue": "6.5h"
    },
    {
      "insert_order": 7466,
      "priority": "NORMAL",
      "source": "ilm-move-to-step {policy [ilm_policy], index [my_index_syslog -000155], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}], nextStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"attempt-rollover\"}]}",
      "executing": false,
      "time_in_queue_millis": 23153609,
      "time_in_queue": "6.4h"
    }
  ]
}

As you can see, our index "my_index_syslog-000155" is stuck at the "check-rollover-ready" step, even though it should have rolled over based on our ILM policy (max_size: 10gb or max_age: 20m).

Here is another example when the disk partition is full:

{
  "tasks" : [
    {
      "insert_order" : 8624,
      "priority" : "IMMEDIATE",
      "source" : "node-left",
      "executing" : false,
      "time_in_queue_millis" : 354266805,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8625,
      "priority" : "IMMEDIATE",
      "source" : "node-left",
      "executing" : false,
      "time_in_queue_millis" : 354266605,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8607,
      "priority" : "URGENT",
      "source" : "update_tsdb_data_stream_end_times",
      "executing" : false,
      "time_in_queue_millis" : 376919703,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8622,
      "priority" : "URGENT",
      "source" : "node-join",
      "executing" : false,
      "time_in_queue_millis" : 354270007,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8623,
      "priority" : "URGENT",
      "source" : "node-join",
      "executing" : false,
      "time_in_queue_millis" : 354269806,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8609,
      "priority" : "HIGH",
      "source" : "cluster_reroute(disk threshold monitor)",
      "executing" : false,
      "time_in_queue_millis" : 375166869,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8621,
      "priority" : "HIGH",
      "source" : "shard-failed",
      "executing" : false,
      "time_in_queue_millis" : 354461489,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8605,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2025.04.29-000001], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 377099256,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8606,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [ilm_policy], index [my_index_syslog-000100], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 377099256,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8608,
      "priority" : "NORMAL",
      "source" : "ilm-move-to-step {policy [ilm_policy], index [my_index _syslog-000100], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}], nextStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"attempt-rollover\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 376199315,
      "time_in_queue" : "4.3d"
    }
  ]
}

The only workaround that I’ve found to solve the issue is to restart all elasticsearch node when the pending task “update_tsdb_data_stream_end_times” occurs.

We've been unable to identify the root cause of this "update_tsdb_data_stream_end_times" pending task.

Has anyone encountered this issue or can suggest how to resolve it?

Thank you for your help!

Tortoise · May 20, 2025, 8:42am

Hello,

Could you please confirm :

At time of issue or before the occurence of issue if the Cluster State/Health is OK & there are no issues?
Could you check the pending tasks count per node?
Task management API | Elasticsearch Guide [8.18] | Elastic
Have you found anything unusual on the Elasticsearch logs?
Next time can check
GET my_index_syslog-000100/_ilm/explain

Thanks!!

bpaoli · May 23, 2025, 12:23pm

Hello,

Thanks for you answer.

Yes no issues detected in cluster, before of when the issue occurs (when ilm pending tasks are blocked by the update_tsdb_data_stream_end_times tasks), the health statuts of the cluster remains green.
Here is an extract of a GET request to elastiscearch cluster on url "_tasks" when the pending task "update_tsdb_data_stream_end_times" occurs :

{
  "tasks" : {
    "nP9kF78cTJyyRtgtVbmaTA:28184136" : {
      "node" : "nP9kF78cTJyyRtgtVbmaTA",
      "id" : 28184136,
      "type" : "transport",
      "action" : "cluster:monitor/tasks/lists",
      "description" : "",
      "start_time_in_millis" : 1746058741383,
      "running_time_in_nanos" : 195985,
      "cancellable" : true,
      "cancelled" : false,
      "headers" : { },
      "children" : [
        {
          "node" : "nP9kF78cTJyyRtgtVbmaTA",
          "id" : 28184137,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "description" : "",
          "start_time_in_millis" : 1746058741384,
          "running_time_in_nanos" : 59306,
          "cancellable" : true,
          "cancelled" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184136",
          "headers" : { }
        },
        {
          "node" : "mv8BvCSqSeqbYKPBBRIzUA",
          "id" : 31154761,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "description" : "",
          "start_time_in_millis" : 1746058741384,
          "running_time_in_nanos" : 56617,
          "cancellable" : true,
          "cancelled" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184136",
          "headers" : { }
        },
        {
          "node" : "fQ5flyTAQjusRvNX_z2e1Q",
          "id" : 30060773,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "description" : "",
          "start_time_in_millis" : 1746058741385,
          "running_time_in_nanos" : 33331,
          "cancellable" : true,
          "cancelled" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184136",
          "headers" : { }
        }
      ]
    },
    "nP9kF78cTJyyRtgtVbmaTA:28184131" : {
      "node" : "nP9kF78cTJyyRtgtVbmaTA",
      "id" : 28184131,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741381,
      "running_time_in_nanos" : 2692490,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "nP9kF78cTJyyRtgtVbmaTA",
          "id" : 28184134,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "waiting_on_primary"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741383,
          "running_time_in_nanos" : 621558,
          "cancellable" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184131",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184135,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s][p]",
              "status" : {
                "phase" : "primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741383,
              "running_time_in_nanos" : 582414,
              "cancellable" : false,
              "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184134",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              }
            }
          ]
        }
      ]
    },
    "fQ5flyTAQjusRvNX_z2e1Q:46" : {
      "node" : "fQ5flyTAQjusRvNX_z2e1Q",
      "id" : 46,
      "type" : "persistent",
      "action" : "health-node[c]",
      "status" : {
        "state" : "STARTED"
      },
      "description" : "id=health-node",
      "start_time_in_millis" : 1745933117413,
      "running_time_in_nanos" : 125623971764880,
      "cancellable" : true,
      "cancelled" : false,
      "parent_task_id" : "cluster:8",
      "headers" : { }
    },
    "fQ5flyTAQjusRvNX_z2e1Q:47" : {
      "node" : "fQ5flyTAQjusRvNX_z2e1Q",
      "id" : 47,
      "type" : "persistent",
      "action" : "geoip-downloader[c]",
      "status" : {
        "successful_downloads" : 0,
        "failed_downloads" : 0,
        "total_download_time" : 0,
        "databases_count" : 0,
        "skipped_updates" : 0,
        "expired_databases" : 0
      },
      "description" : "id=geoip-downloader",
      "start_time_in_millis" : 1745933117414,
      "running_time_in_nanos" : 125623971070945,
      "cancellable" : true,
      "cancelled" : false,
      "parent_task_id" : "cluster:9",
      "headers" : { }
    },
    "mv8BvCSqSeqbYKPBBRIzUA:31154755" : {
      "node" : "mv8BvCSqSeqbYKPBBRIzUA",
      "id" : 31154755,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741372,
      "running_time_in_nanos" : 12130414,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "mv8BvCSqSeqbYKPBBRIzUA",
          "id" : 31154756,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741372,
          "running_time_in_nanos" : 12002698,
          "cancellable" : false,
          "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154755",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184127,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741373,
              "running_time_in_nanos" : 10952525,
              "cancellable" : false,
              "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154756",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              },
              "children" : [
                {
                  "node" : "nP9kF78cTJyyRtgtVbmaTA",
                  "id" : 28184128,
                  "type" : "transport",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[125], index[my_index_syslog-000100][0]",
                  "start_time_in_millis" : 1746058741373,
                  "running_time_in_nanos" : 10872276,
                  "cancellable" : false,
                  "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184127",
                  "headers" : {
                    "X-elastic-product-origin" : "logstash-output-elasticsearch"
                  }
                }
              ]
            }
          ]
        }
      ]
    },
    "mv8BvCSqSeqbYKPBBRIzUA:31154759" : {
      "node" : "mv8BvCSqSeqbYKPBBRIzUA",
      "id" : 31154759,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741373,
      "running_time_in_nanos" : 10824884,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "mv8BvCSqSeqbYKPBBRIzUA",
          "id" : 31154760,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741373,
          "running_time_in_nanos" : 10685634,
          "cancellable" : false,
          "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154759",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184129,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741374,
              "running_time_in_nanos" : 9547174,
              "cancellable" : false,
              "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154760",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              },
              "children" : [
                {
                  "node" : "nP9kF78cTJyyRtgtVbmaTA",
                  "id" : 28184130,
                  "type" : "transport",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[125], index[my_index_syslog-000100][0]",
                  "start_time_in_millis" : 1746058741374,
                  "running_time_in_nanos" : 9394368,
                  "cancellable" : false,
                  "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184129",
                  "headers" : {
                    "X-elastic-product-origin" : "logstash-output-elasticsearch"
                  }
                }
              ]
            }
          ]
        }
      ]
    },
    "fQ5flyTAQjusRvNX_z2e1Q:30060767" : {
      "node" : "fQ5flyTAQjusRvNX_z2e1Q",
      "id" : 30060767,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741380,
      "running_time_in_nanos" : 4870133,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "fQ5flyTAQjusRvNX_z2e1Q",
          "id" : 30060768,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741380,
          "running_time_in_nanos" : 4728549,
          "cancellable" : false,
          "parent_task_id" : "fQ5flyTAQjusRvNX_z2e1Q:30060767",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184132,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741383,
              "running_time_in_nanos" : 906145,
              "cancellable" : false,
              "parent_task_id" : "fQ5flyTAQjusRvNX_z2e1Q:30060768",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              },
              "children" : [
                {
                  "node" : "nP9kF78cTJyyRtgtVbmaTA",
                  "id" : 28184133,
                  "type" : "transport",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[125], index[my_index_syslog-000100][0]",
                  "start_time_in_millis" : 1746058741383,
                  "running_time_in_nanos" : 797269,
                  "cancellable" : false,
                  "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184132",
                  "headers" : {
                    "X-elastic-product-origin" : "logstash-output-elasticsearch"
                  }
                }
              ]
            }
          ]
        }
      ]
    },
    "nP9kF78cTJyyRtgtVbmaTA:28184124" : {
      "node" : "nP9kF78cTJyyRtgtVbmaTA",
      "id" : 28184124,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741370,
      "running_time_in_nanos" : 13491482,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "nP9kF78cTJyyRtgtVbmaTA",
          "id" : 28184125,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "waiting_on_primary"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741370,
          "running_time_in_nanos" : 13291034,
          "cancellable" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184124",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "fQ5flyTAQjusRvNX_z2e1Q",
              "id" : 30060771,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s][r]",
              "status" : {
                "phase" : "replica"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741384,
              "running_time_in_nanos" : 623658,
              "cancellable" : false,
              "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184125",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              }
            },
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184126,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s][p]",
              "status" : {
                "phase" : "primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741370,
              "running_time_in_nanos" : 13206032,
              "cancellable" : false,
              "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184125",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              }
            }
          ]
        }
      ]
    }
  }
}

Nothing unusual has been detected on the elastiscearch logs, we can see that at a regular time interval there are logs indicating that ilm policy is ready for rollover to process be cause conditions are reached.
Here is an extract of a GET request to elastiscearch cluster on url "my_index_syslog-000100/_ilm/explain" when the pending task "update_tsdb_data_stream_end_times" occurs:

{
  "indices" : {
    "my_index_syslog-000100" : {
      "index" : "my_index_syslog-000100",
      "managed" : true,
      "policy" : "raw_syslog",
      "index_creation_date_millis" : 1746057376892,
      "time_since_index_creation" : "22.73m",
      "lifecycle_date_millis" : 1746057376892,
      "age" : "22.73m",
      "phase" : "hot",
      "phase_time_millis" : 1746057377030,
      "action" : "rollover",
      "action_time_millis" : 1746057377230,
      "step" : "check-rollover-ready",
      "step_time_millis" : 1746057377230,
      "phase_execution" : {
        "policy" : "raw_syslog",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_age" : "20m",
              "max_primary_shard_docs" : 200000000,
              "min_docs" : 1,
              "max_size" : "10gb"
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1745929943516
      }
    }
  }
}

So we can see here that age is higher than max_age parameter, and I've seen that when the issue occurs, we can see in _ilm/explain url that the age will grow undefinedly until either disk saturation occurs or I restart elasticsearch nodes.

Thank you for your help.

bpaoli · June 18, 2025, 7:58am

Update on the issue:

I've been exploring a potential workaround by upgrading Elasticsearch to version 8.17.7. Since implementing this change, the problem hasn't reoccurred for several days now. I'll continue to monitor and update this ticket if needed.

If anyone has insights into the root cause of this issue, I would greatly appreciate your input.

ropc · June 19, 2025, 5:30am

Hi @bpaoli - I suspect you faced a known issue with Elasticsearch where the master node stops processing tasks. The related issue is Scaling EsExecutors with core size 0 might starve work due to missing workers · Issue #124667 · elastic/elasticsearch · GitHub and it has been fixed in 8.17.4 (c.f pull request - Prevent starvation bug if using scaling EsThreadPoolExecutor with core pool size = 0 by mosche · Pull Request #124732 · elastic/elasticsearch · GitHub)

One way to identify the problem could be to retrieve the list of pending tasks (c.f GET _cluster/pending_tasks?human) and run a similar command:

cat cluster_pending_tasks.json | jq '.tasks[].executing' | sort | uniq -c                                                                                                 
NNN false

If the above command returns only a single line NNN (where NNN is a number) and false then it means there are tasks queued but none is running which is (almost certainly) an indication of the problem.

Topic		Replies	Views
Elasticsearch cluster have millions of pending tasks Elasticsearch	15	1211	June 8, 2021
Troubleshooting ES Resharding. Nature of immediate tasks and other questions Elasticsearch	5	741	July 6, 2017
Pending tasks queue Elasticsearch	8	3418	July 5, 2017
Garbage collection pauses causing cluster to get unresponsive Elasticsearch	20	1957	July 6, 2017
Cluster not able to keep up? Elasticsearch	12	4240	July 6, 2017

Elasticsearch 8.17.3: Stuck "update_tsdb_data_stream_end_times" Pending Task Blocking ILM Policy

Environment

Configuration

Usage Pattern

Issue

Related topics