Increasing number of pending tasks despite small number of shards


I'm fairly new to Elasticsearch and I'm trying to maintain a small cluster.
Currently I'm having trouble with a growing number of pending tasks. In all other threads that I have looked at the issue have been caused by having a large amount of shards.. However I'm quite sure that is not the case here.. Here is the output from /_cluster/health

  "status": "yellow",
  "number_of_nodes": 3,
  "unassigned_shards": 3,
  "number_of_pending_tasks": 2254765,
  "number_of_in_flight_fetch": 0,
  "timed_out": false,
  "active_primary_shards": 218,
  "task_max_waiting_in_queue_millis": 43353576,
  "relocating_shards": 0,
  "active_shards_percent_as_number": 98.66071428571429,
  "active_shards": 221,
  "initializing_shards": 0,
  "number_of_data_nodes": 2,
  "delayed_unassigned_shards": 0

We are running without replicas except for a few select Elasticsearch system indices. The ILM is set to 20 GB or 30 days, with deletion after 60 days.

From what I understand this should not in any way be able to cause the issue that we are seeing.
The status is yellow because the periodic snapshot is failing. This might be the cause or at least have the same root cause.

The snapshot supposedly fails because there are 279 shard failures (primary shard is not allocated), but from what I can see this is not true..

We did do a downgrade of the "hot" node around the time when the snapshot issue started. Some days later the node restarted because it ran out of memory at which point the shards were in fact unavailable.

We tried to upgrade the node again, but it failed, however every time we tried the number of unassigned shards went down, according to the overview, but not the snapshot menu.
(Side question I couldn't find a way to get the shards assigned without attempting to change the config, can anyone tell me how I could have done it?)

It seems that the cluster has ended up in a weird state where everything seems to be working except for the snapshot the growing number of tasks..

Please help me learn and figure out how to fix this.

I had to restart the cluster yesterday because the number of tasks had increased to more than 10M over the weekend and one of the nodes were running out of memory. Naturally this cleared the pending tasks but now, less than a day later, it is at 2.5M again.
I can't figure out the what task is blocking, but it might be a template creation. At least the ES log is being spammed with this message multiple times a second:
[instance-0000000026] adding template [MyIndex] for index patterns [MyIndex-*]

Which version of Elasticsearch are you using? What is the specification of the cluster with respect to hardware? What type of storage are you using? Local SSDs?

We are using Elasticsearch 7.5 and running in the cloud with 3 nodes (2 data nodes) all in the same zone:

  • aws.coordinating.m5 - up to 4 vCPU - 8 GB RAM - 32 GB disk
  • - 3.8 vCPU - 29 GB RAM - 870 GB disk (shows as 928 GB with 370 GB used)
  • - 3.8 vCPU - 29 GB RAM - 4.53 TB disk (shows as 4.59 TB with 1.4 TB used)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.