We have a cluster with a high number of pending tasks. An example below
{
"insert_order" : 17373233,
"priority" : "NORMAL",
"source" : "indices_store ([[logstash-CUSTOMER-PRODUCT-2016.03.18][2]] active fully on other nodes)",
"executing" : false,
"time_in_queue_millis" : 1209415567,
"time_in_queue" : "13.9d"
},
When we did the following query curl -XGET 'http://localhost:9200/_cluster/pending_tasks?pretty=true' > pending_tasks.json
, we got back about 251 MB of data!!
total 515160
drwxr-xr-x 5 vasu staff 170B Apr 26 13:39 .
drwxr-xr-x 7 vasu staff 238B Apr 26 13:34 ..
-rw-r--r-- 1 vasu staff 77K Apr 26 13:38 nodes.json
-rw-r--r-- 1 vasu staff 5.9K Apr 26 13:39 nodes_abbreviated.json
-rw-r--r-- 1 vasu staff 251M Apr 26 13:37 pending_tasks.json
According to the API docs, these should be generally ZERO and in "rare cases" when master is the bottleneck, it can be high - but only for a few thousand milliseconds. We are seeing DAYS for these values.
There is obviously something wrong (perhaps). Any suggestions on how to go about understanding what is actually wrong? Almost all the tasks are of the same form as above "indices_store .... everything is normal..blah".
A few days ago, our masters started having problems and were falling over. That has since been mitigated. These are perhaps old pending tasks that never got deleted? (our current best guess)
Our ES Version is 2.1.2, and the rest of the information is below.
ec2-user@Elasticsearch:~> curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
"cluster_name" : "Elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 12,
"number_of_data_nodes" : 9,
"active_primary_shards" : 5023,
"active_shards" : 10046,
"relocating_shards" : 2,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 1165671,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 1504202,
"active_shards_percent_as_number" : 100.0
}
Thanks!