Cluster pending_tasks - what do they mean?

We have a cluster with a high number of pending tasks. An example below

 {
    "insert_order" : 17373233,
    "priority" : "NORMAL",
    "source" : "indices_store ([[logstash-CUSTOMER-PRODUCT-2016.03.18][2]] active fully on other nodes)",
    "executing" : false,
    "time_in_queue_millis" : 1209415567,
    "time_in_queue" : "13.9d"
  },

When we did the following query curl -XGET 'http://localhost:9200/_cluster/pending_tasks?pretty=true' > pending_tasks.json, we got back about 251 MB of data!!

total 515160
drwxr-xr-x  5 vasu  staff   170B Apr 26 13:39 .
drwxr-xr-x  7 vasu  staff   238B Apr 26 13:34 ..
-rw-r--r--  1 vasu  staff    77K Apr 26 13:38 nodes.json
-rw-r--r--  1 vasu  staff   5.9K Apr 26 13:39 nodes_abbreviated.json
-rw-r--r--  1 vasu  staff   251M Apr 26 13:37 pending_tasks.json

According to the API docs, these should be generally ZERO and in "rare cases" when master is the bottleneck, it can be high - but only for a few thousand milliseconds. We are seeing DAYS for these values.

There is obviously something wrong (perhaps). Any suggestions on how to go about understanding what is actually wrong? Almost all the tasks are of the same form as above "indices_store .... everything is normal..blah".

A few days ago, our masters started having problems and were falling over. That has since been mitigated. These are perhaps old pending tasks that never got deleted? (our current best guess)

Our ES Version is 2.1.2, and the rest of the information is below.

ec2-user@Elasticsearch:~> curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "Elasticsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 12,
  "number_of_data_nodes" : 9,
  "active_primary_shards" : 5023,
  "active_shards" : 10046,
  "relocating_shards" : 2,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 1165671,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 1504202,
  "active_shards_percent_as_number" : 100.0
}

Thanks!

Small update on this. We killed the master (after creating 3 dedicated master nodes). Now the number of tasks "during" the shard allocation ... is as follows

ec2-user@Elasticsearch:~> curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "Elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 12,
  "number_of_data_nodes" : 9,
  "active_primary_shards" : 5023,
  "active_shards" : 9018,
  "relocating_shards" : 0,
  "initializing_shards" : 2,
  "unassigned_shards" : 1026,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 3,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 117141,
  "active_shards_percent_as_number" : 89.76707147123233
}

Number of tasks back to normal numbers than what it was before. If anyone has any ideas on what potentially could've happened previously, that would be awesome.

previously we did NOT have dedicated master nodes, as our throughput to ES is really pretty small. After 4 months, we have shy of 5GB of data in ES.

That's pretty impressive! But also bad :frowning:

What sort of data do you have in the cluster? How many queries per second? How large are your nodes? What sort of tasks are in the queue?