Pending_tasks has millions of entries, many of with are exact duplicates of ilm-move-to-step


I have a 7.10.2 cluster, quite large, with 10k indices, and we are having issues with millions of pending tasks being queued at some point.

I managed to get a dump of them while they were just 2M, and I saw:

  • 90% of them are ilm-move-to-step
  • for a given ilm-move-to-step, we can see it repeats > 2k times like so:

     20439        1.5h NORMAL   ilm-move-to-step {policy [retention_4d], index [index1234], currentStep [{"phase":"new","action":"complete","name":"complete"}], nextStep [{"phase":"warm","action":"set_priority","name":"set_priority"}]}
      38435        1.5h NORMAL   ilm-move-to-step {policy [retention_4d], index [index1234], currentStep [{"p...

This cluster has 45 data nodes.
What could be the reason for so many duplication in tasks?

This version is very old, long past EOL. You need to upgrade to a supported version as soon as possible.

There are several known scalability issues in this version that could explain these symptoms, and which are fixed in newer versions. See e.g. the issues linked from here:

sure it's old, but upgrading it's not so easy. Do you reckon upgrading to 7.17 would bring up some benefits? Upgrading to 8 it's a much harder thing, but 7.17 we might do...


Yes, 7.17 will give you some of the benefits I linked.

