Hello,
we are using elasticsearch to create daily indices of our logfiles. This means every night at 00:00 UTC all new indices are being created for the day.
Since two days we are having issues in one of our clusters at exactly that point in time.
Some new indices are created, but they have no data in them. The rest of the indices are not being created at all. Also the pending task queue is growing significantly with not a lot of tasks being processed. Cluster health is green the whole time, with 0 shards being relocated, initialized, unassigned or delayed.
The first day I was able to get ES processing data again by executing:
curl -XPUT localhost:19210/_cluster/settings -d '{ "transient" : { "threadpool.bulk.queue_size" : 1000 } }'
After that all new indices were being created, as well as new data was flowing in again.
The next day the exact same problem occurred again. I ran the same command again, but set the queue size to 1100 and it fixed the issues again.
However I am not sure if this is really related to the command, or it simply started working again because the command flushed the task queue. I also tried performing rolling restarts of the master nodes to flush the task queue. However when it flushed the queue then, it wasn't helping.
So either it is related to the bulk queue size, or I got lucky two times flushing the queue at the right point in time.
ES Version: 2.4.1
Logstash Versions: 5.3.0, 2.4.1 running in parallel at the moment
Additional log infos:
Very high number of pending tasks with:
"tasks" : [ { "insert_order" : 183826, "priority" : "URGENT", "source" : "create-index-template [metricbeat], cause [api]", "executing" : true, "time_in_queue_millis" : 5655, "time_in_queue" : "5.6s" }, { "insert_order" : 183830,
and
{ "insert_order" : 183939, "priority" : "HIGH", "source" : "_add_listener_", "executing" : false, "time_in_queue_millis" : 108, "time_in_queue" : "108ms" }
Logstash logs filled with:
[2017-05-27T12:38:44,365][INFO ][logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$4@7cdd3880 on EsThreadPoolExecutor[bulk, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@562cd8c1[Running, pool size = 32, active threads = 32, queued tasks = 1852, completed tasks = 8490655]]"})
Can you help me out here? It looks like increasing the value was only a short fix and the problem will reoccur again.
Best Regards!