Hi,
Our elasticsearch cluster get overloaded from time to time.
We see thread_pool rejections on elasticsearch and logstash has error messages like the below
[logstash.outputs.elasticsearch] retrying failed action with response code: 429 ({"type"=>"es_rejected_execution_exception", "reason"=>"rejected execution of org.elasticsearch.transport.TransportService$7@57757508 on EsThreadPoolExecutor[bulk, queue capacity = 500, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@127e2b61[Running, pool size = 32, active threads = 32, queued tasks = 507, completed tasks = 575009786]]"})
(Increasing the thread_pool.bulk.queue_size to 500 had helped a bit)
Our cluster has 26TB of data, 8 hot, 5 warm and 10 cold nodes. We have ~2000 indices across 6800 primary shards, replicated once. We are running elasticsearch 5.5. Each elasticsearch instance has 30GB of memory.
The hot nodes have a max of 80 shards each.
Looking at our servers the cpu load and IO are very low. CPU is around 15%, IO 30% with peaks of 50%. File/ulimits are also fine.
We are wondering why elasticsearch isn't using more of the resources if it's under load / overloaded. And if there are settings in elasticsearch to improve the performance and get rid of those errors.
It seems the most load is from indexing so we increased indices.memory.index_buffer_size to 30%
Any tips would be appreciated.
Cheers,
Felix