Elasticsearch 8.17.3: Stuck "update_tsdb_data_stream_end_times" Pending Task Blocking ILM Policy

Environment

  • Elasticsearch 8.17.3 in a 3-node cluster
  • SLES-OS 12 VMs
  • JDK 21.0.5+11-JRE (issue also occurred with embedded JDK (java 23) and Java 17)
  • 30GB partition for Elasticsearch data
  • 64GB RAM per VM with JVM settings: -Xms8g -Xmx8g
  • SSL communication between nodes on one subnet, client communication on another subnet

Configuration

  • ILM policy “ilm_policy” for our main index "my_index_syslog":
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_age": "20m",
            "max_size": "10gb"
          }
        }
      },
      "delete": {
        "min_age": "0m",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
  • ILM poll interval: indices.lifecycle.poll_interval: 1m

Usage Pattern

  • Two Logstash 8.17.3 clients continuously ingesting logs (up to 3000+ logs/second)
  • Clients query the index every 15 seconds
  • Approximately 50 additional client requests every 15 seconds

Issue

I've observed recently a persistent URGENT priority pending task "update_tsdb_data_stream_end_times" that never resolves. This is blocking the ILM policy tasks, preventing rollovers, and causing the storage partition to fill up. Important note: The issue appears anywhere between 5 hours and 1 week after installing Elasticsearch on the VMs.
Here's an example of the _cluster/pending_tasks output when the disk partition is not full:

{
  "tasks": [
    {
      "insert_order": 7465,
      "priority": "URGENT",
      "source": "update_tsdb_data_stream_end_times",
      "executing": false,
      "time_in_queue_millis": 23574471,
      "time_in_queue": "6.5h"
    },
    {
      "insert_order": 7463,
      "priority": "NORMAL",
      "source": "ilm-set-step-info {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2025.04.22-000001], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing": false,
      "time_in_queue_millis": 23633734,
      "time_in_queue": "6.5h"
    },
    {
      "insert_order": 7464,
      "priority": "NORMAL",
      "source": "ilm-set-step-info {policy [ilm_policy], index [my_index_syslog -000155], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing": false,
      "time_in_queue_millis": 23633734,
      "time_in_queue": "6.5h"
    },
    {
      "insert_order": 7466,
      "priority": "NORMAL",
      "source": "ilm-move-to-step {policy [ilm_policy], index [my_index_syslog -000155], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}], nextStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"attempt-rollover\"}]}",
      "executing": false,
      "time_in_queue_millis": 23153609,
      "time_in_queue": "6.4h"
    }
  ]
}

As you can see, our index "my_index_syslog-000155" is stuck at the "check-rollover-ready" step, even though it should have rolled over based on our ILM policy (max_size: 10gb or max_age: 20m).

Here is another example when the disk partition is full:

{
  "tasks" : [
    {
      "insert_order" : 8624,
      "priority" : "IMMEDIATE",
      "source" : "node-left",
      "executing" : false,
      "time_in_queue_millis" : 354266805,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8625,
      "priority" : "IMMEDIATE",
      "source" : "node-left",
      "executing" : false,
      "time_in_queue_millis" : 354266605,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8607,
      "priority" : "URGENT",
      "source" : "update_tsdb_data_stream_end_times",
      "executing" : false,
      "time_in_queue_millis" : 376919703,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8622,
      "priority" : "URGENT",
      "source" : "node-join",
      "executing" : false,
      "time_in_queue_millis" : 354270007,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8623,
      "priority" : "URGENT",
      "source" : "node-join",
      "executing" : false,
      "time_in_queue_millis" : 354269806,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8609,
      "priority" : "HIGH",
      "source" : "cluster_reroute(disk threshold monitor)",
      "executing" : false,
      "time_in_queue_millis" : 375166869,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8621,
      "priority" : "HIGH",
      "source" : "shard-failed",
      "executing" : false,
      "time_in_queue_millis" : 354461489,
      "time_in_queue" : "4.1d"
    },
    {
      "insert_order" : 8605,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [.deprecation-indexing-ilm-policy], index [.ds-.logs-deprecation.elasticsearch-default-2025.04.29-000001], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 377099256,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8606,
      "priority" : "NORMAL",
      "source" : "ilm-set-step-info {policy [ilm_policy], index [my_index_syslog-000100], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 377099256,
      "time_in_queue" : "4.3d"
    },
    {
      "insert_order" : 8608,
      "priority" : "NORMAL",
      "source" : "ilm-move-to-step {policy [ilm_policy], index [my_index _syslog-000100], currentStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"check-rollover-ready\"}], nextStep [{\"phase\":\"hot\",\"action\":\"rollover\",\"name\":\"attempt-rollover\"}]}",
      "executing" : false,
      "time_in_queue_millis" : 376199315,
      "time_in_queue" : "4.3d"
    }
  ]
}

The only workaround that I’ve found to solve the issue is to restart all elasticsearch node when the pending task “update_tsdb_data_stream_end_times” occurs.

We've been unable to identify the root cause of this "update_tsdb_data_stream_end_times" pending task.

Has anyone encountered this issue or can suggest how to resolve it?

Thank you for your help!

Hello,

Could you please confirm :

  1. At time of issue or before the occurence of issue if the Cluster State/Health is OK & there are no issues?
  2. Could you check the pending tasks count per node?
    Task management API | Elasticsearch Guide [8.18] | Elastic
  3. Have you found anything unusual on the Elasticsearch logs?
  4. Next time can check
    GET my_index_syslog-000100/_ilm/explain

Thanks!!

Hello,

Thanks for you answer.

  1. Yes no issues detected in cluster, before of when the issue occurs (when ilm pending tasks are blocked by the update_tsdb_data_stream_end_times tasks), the health statuts of the cluster remains green.
  2. Here is an extract of a GET request to elastiscearch cluster on url "_tasks" when the pending task "update_tsdb_data_stream_end_times" occurs :
{
  "tasks" : {
    "nP9kF78cTJyyRtgtVbmaTA:28184136" : {
      "node" : "nP9kF78cTJyyRtgtVbmaTA",
      "id" : 28184136,
      "type" : "transport",
      "action" : "cluster:monitor/tasks/lists",
      "description" : "",
      "start_time_in_millis" : 1746058741383,
      "running_time_in_nanos" : 195985,
      "cancellable" : true,
      "cancelled" : false,
      "headers" : { },
      "children" : [
        {
          "node" : "nP9kF78cTJyyRtgtVbmaTA",
          "id" : 28184137,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "description" : "",
          "start_time_in_millis" : 1746058741384,
          "running_time_in_nanos" : 59306,
          "cancellable" : true,
          "cancelled" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184136",
          "headers" : { }
        },
        {
          "node" : "mv8BvCSqSeqbYKPBBRIzUA",
          "id" : 31154761,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "description" : "",
          "start_time_in_millis" : 1746058741384,
          "running_time_in_nanos" : 56617,
          "cancellable" : true,
          "cancelled" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184136",
          "headers" : { }
        },
        {
          "node" : "fQ5flyTAQjusRvNX_z2e1Q",
          "id" : 30060773,
          "type" : "transport",
          "action" : "cluster:monitor/tasks/lists[n]",
          "description" : "",
          "start_time_in_millis" : 1746058741385,
          "running_time_in_nanos" : 33331,
          "cancellable" : true,
          "cancelled" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184136",
          "headers" : { }
        }
      ]
    },
    "nP9kF78cTJyyRtgtVbmaTA:28184131" : {
      "node" : "nP9kF78cTJyyRtgtVbmaTA",
      "id" : 28184131,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741381,
      "running_time_in_nanos" : 2692490,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "nP9kF78cTJyyRtgtVbmaTA",
          "id" : 28184134,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "waiting_on_primary"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741383,
          "running_time_in_nanos" : 621558,
          "cancellable" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184131",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184135,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s][p]",
              "status" : {
                "phase" : "primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741383,
              "running_time_in_nanos" : 582414,
              "cancellable" : false,
              "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184134",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              }
            }
          ]
        }
      ]
    },
    "fQ5flyTAQjusRvNX_z2e1Q:46" : {
      "node" : "fQ5flyTAQjusRvNX_z2e1Q",
      "id" : 46,
      "type" : "persistent",
      "action" : "health-node[c]",
      "status" : {
        "state" : "STARTED"
      },
      "description" : "id=health-node",
      "start_time_in_millis" : 1745933117413,
      "running_time_in_nanos" : 125623971764880,
      "cancellable" : true,
      "cancelled" : false,
      "parent_task_id" : "cluster:8",
      "headers" : { }
    },
    "fQ5flyTAQjusRvNX_z2e1Q:47" : {
      "node" : "fQ5flyTAQjusRvNX_z2e1Q",
      "id" : 47,
      "type" : "persistent",
      "action" : "geoip-downloader[c]",
      "status" : {
        "successful_downloads" : 0,
        "failed_downloads" : 0,
        "total_download_time" : 0,
        "databases_count" : 0,
        "skipped_updates" : 0,
        "expired_databases" : 0
      },
      "description" : "id=geoip-downloader",
      "start_time_in_millis" : 1745933117414,
      "running_time_in_nanos" : 125623971070945,
      "cancellable" : true,
      "cancelled" : false,
      "parent_task_id" : "cluster:9",
      "headers" : { }
    },
    "mv8BvCSqSeqbYKPBBRIzUA:31154755" : {
      "node" : "mv8BvCSqSeqbYKPBBRIzUA",
      "id" : 31154755,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741372,
      "running_time_in_nanos" : 12130414,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "mv8BvCSqSeqbYKPBBRIzUA",
          "id" : 31154756,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741372,
          "running_time_in_nanos" : 12002698,
          "cancellable" : false,
          "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154755",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184127,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741373,
              "running_time_in_nanos" : 10952525,
              "cancellable" : false,
              "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154756",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              },
              "children" : [
                {
                  "node" : "nP9kF78cTJyyRtgtVbmaTA",
                  "id" : 28184128,
                  "type" : "transport",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[125], index[my_index_syslog-000100][0]",
                  "start_time_in_millis" : 1746058741373,
                  "running_time_in_nanos" : 10872276,
                  "cancellable" : false,
                  "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184127",
                  "headers" : {
                    "X-elastic-product-origin" : "logstash-output-elasticsearch"
                  }
                }
              ]
            }
          ]
        }
      ]
    },
    "mv8BvCSqSeqbYKPBBRIzUA:31154759" : {
      "node" : "mv8BvCSqSeqbYKPBBRIzUA",
      "id" : 31154759,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741373,
      "running_time_in_nanos" : 10824884,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "mv8BvCSqSeqbYKPBBRIzUA",
          "id" : 31154760,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741373,
          "running_time_in_nanos" : 10685634,
          "cancellable" : false,
          "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154759",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184129,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741374,
              "running_time_in_nanos" : 9547174,
              "cancellable" : false,
              "parent_task_id" : "mv8BvCSqSeqbYKPBBRIzUA:31154760",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              },
              "children" : [
                {
                  "node" : "nP9kF78cTJyyRtgtVbmaTA",
                  "id" : 28184130,
                  "type" : "transport",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[125], index[my_index_syslog-000100][0]",
                  "start_time_in_millis" : 1746058741374,
                  "running_time_in_nanos" : 9394368,
                  "cancellable" : false,
                  "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184129",
                  "headers" : {
                    "X-elastic-product-origin" : "logstash-output-elasticsearch"
                  }
                }
              ]
            }
          ]
        }
      ]
    },
    "fQ5flyTAQjusRvNX_z2e1Q:30060767" : {
      "node" : "fQ5flyTAQjusRvNX_z2e1Q",
      "id" : 30060767,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741380,
      "running_time_in_nanos" : 4870133,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "fQ5flyTAQjusRvNX_z2e1Q",
          "id" : 30060768,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "rerouted"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741380,
          "running_time_in_nanos" : 4728549,
          "cancellable" : false,
          "parent_task_id" : "fQ5flyTAQjusRvNX_z2e1Q:30060767",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184132,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s]",
              "status" : {
                "phase" : "waiting_on_primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741383,
              "running_time_in_nanos" : 906145,
              "cancellable" : false,
              "parent_task_id" : "fQ5flyTAQjusRvNX_z2e1Q:30060768",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              },
              "children" : [
                {
                  "node" : "nP9kF78cTJyyRtgtVbmaTA",
                  "id" : 28184133,
                  "type" : "transport",
                  "action" : "indices:data/write/bulk[s][p]",
                  "status" : {
                    "phase" : "primary"
                  },
                  "description" : "requests[125], index[my_index_syslog-000100][0]",
                  "start_time_in_millis" : 1746058741383,
                  "running_time_in_nanos" : 797269,
                  "cancellable" : false,
                  "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184132",
                  "headers" : {
                    "X-elastic-product-origin" : "logstash-output-elasticsearch"
                  }
                }
              ]
            }
          ]
        }
      ]
    },
    "nP9kF78cTJyyRtgtVbmaTA:28184124" : {
      "node" : "nP9kF78cTJyyRtgtVbmaTA",
      "id" : 28184124,
      "type" : "transport",
      "action" : "indices:data/write/bulk",
      "description" : "requests[125], indices[my_index_syslog]",
      "start_time_in_millis" : 1746058741370,
      "running_time_in_nanos" : 13491482,
      "cancellable" : false,
      "headers" : {
        "X-elastic-product-origin" : "logstash-output-elasticsearch"
      },
      "children" : [
        {
          "node" : "nP9kF78cTJyyRtgtVbmaTA",
          "id" : 28184125,
          "type" : "transport",
          "action" : "indices:data/write/bulk[s]",
          "status" : {
            "phase" : "waiting_on_primary"
          },
          "description" : "requests[125], index[my_index_syslog-000100][0]",
          "start_time_in_millis" : 1746058741370,
          "running_time_in_nanos" : 13291034,
          "cancellable" : false,
          "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184124",
          "headers" : {
            "X-elastic-product-origin" : "logstash-output-elasticsearch"
          },
          "children" : [
            {
              "node" : "fQ5flyTAQjusRvNX_z2e1Q",
              "id" : 30060771,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s][r]",
              "status" : {
                "phase" : "replica"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741384,
              "running_time_in_nanos" : 623658,
              "cancellable" : false,
              "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184125",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              }
            },
            {
              "node" : "nP9kF78cTJyyRtgtVbmaTA",
              "id" : 28184126,
              "type" : "transport",
              "action" : "indices:data/write/bulk[s][p]",
              "status" : {
                "phase" : "primary"
              },
              "description" : "requests[125], index[my_index_syslog-000100][0]",
              "start_time_in_millis" : 1746058741370,
              "running_time_in_nanos" : 13206032,
              "cancellable" : false,
              "parent_task_id" : "nP9kF78cTJyyRtgtVbmaTA:28184125",
              "headers" : {
                "X-elastic-product-origin" : "logstash-output-elasticsearch"
              }
            }
          ]
        }
      ]
    }
  }
}
  1. Nothing unusual has been detected on the elastiscearch logs, we can see that at a regular time interval there are logs indicating that ilm policy is ready for rollover to process be cause conditions are reached.
  2. Here is an extract of a GET request to elastiscearch cluster on url "my_index_syslog-000100/_ilm/explain" when the pending task "update_tsdb_data_stream_end_times" occurs:
{
  "indices" : {
    "my_index_syslog-000100" : {
      "index" : "my_index_syslog-000100",
      "managed" : true,
      "policy" : "raw_syslog",
      "index_creation_date_millis" : 1746057376892,
      "time_since_index_creation" : "22.73m",
      "lifecycle_date_millis" : 1746057376892,
      "age" : "22.73m",
      "phase" : "hot",
      "phase_time_millis" : 1746057377030,
      "action" : "rollover",
      "action_time_millis" : 1746057377230,
      "step" : "check-rollover-ready",
      "step_time_millis" : 1746057377230,
      "phase_execution" : {
        "policy" : "raw_syslog",
        "phase_definition" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_age" : "20m",
              "max_primary_shard_docs" : 200000000,
              "min_docs" : 1,
              "max_size" : "10gb"
            }
          }
        },
        "version" : 1,
        "modified_date_in_millis" : 1745929943516
      }
    }
  }
}

So we can see here that age is higher than max_age parameter, and I've seen that when the issue occurs, we can see in _ilm/explain url that the age will grow undefinedly until either disk saturation occurs or I restart elasticsearch nodes.

Thank you for your help.

Update on the issue:

I've been exploring a potential workaround by upgrading Elasticsearch to version 8.17.7. Since implementing this change, the problem hasn't reoccurred for several days now. I'll continue to monitor and update this ticket if needed.

If anyone has insights into the root cause of this issue, I would greatly appreciate your input.

Hi @bpaoli - I suspect you faced a known issue with Elasticsearch where the master node stops processing tasks. The related issue is Scaling EsExecutors with core size 0 might starve work due to missing workers · Issue #124667 · elastic/elasticsearch · GitHub and it has been fixed in 8.17.4 (c.f pull request - Prevent starvation bug if using scaling EsThreadPoolExecutor with core pool size = 0 by mosche · Pull Request #124732 · elastic/elasticsearch · GitHub)

One way to identify the problem could be to retrieve the list of pending tasks (c.f GET _cluster/pending_tasks?human) and run a similar command:

cat cluster_pending_tasks.json | jq '.tasks[].executing' | sort | uniq -c                                                                                                 
NNN false

If the above command returns only a single line NNN (where NNN is a number) and false then it means there are tasks queued but none is running which is (almost certainly) an indication of the problem.