Stuck pending tasks

About 10 minutes after setting cluster.routing.allocation.disk.threshold_enabled to true, over 4,000 tasks (see below for example) got queued up and they seem to be stuck. About once every 30 minutes the top 1 or 2 tasks would be removed.

Any idea why these tasks are queued up?


Cluster Status:

cluster_name: "search",
status: "yellow",
timed_out: false,
number_of_nodes: 6,
number_of_data_nodes: 6,
active_primary_shards: 54110,
active_shards: 162326,
relocating_shards: 500,
initializing_shards: 0,
unassigned_shards: 4

Task queued up:

insert_order: 204469,
priority: "URGENT",
source: "shard-started ([c52352ec-7d62-436e-9f74-3e4d3aa0a344][4], node[_kuf5-IYQLKA3YXo9fs_DQ], relocating [XaH-gPDTQRWjdP09dfIuqw], [P], s[INITIALIZING]), reason [after recovery (replica) from node [[Some_Node][XaH-gPDTQRWjdPZJauIuqw][inet[/]]]]",
time_in_queue_millis: 10555125,
time_in_queue: "2.9h"

That is a VERY large number of shards for a 6 node cluster. I would recommend you reconsider your indexing/sharding strategy to reduce the number of indices and shards considerably.

How much data do you have in the cluster? What is your average shard size?

@Christian_Dahlqvist: You are absolutely right. We are in the process of re-architecting our data models now. We actually have a lot of empty shards. Our total data size is only 3.2GB for this cluster.

Any idea how we can get the stuck tasks moved forward?

Wow, you just broke the record for the most number of shards I have seen :wink:

(On a cluster that small)