Receiving the following error in elasticsearch cluster logs
Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 1000) on org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler
Was able to notice the following during one of the times a search query was hung
We have a 11 node ELK cluster in which 3 master nodes, 5 data nodes and 3 logstash nodes.
Data nodes are configured as followed
64GB RAM (32 allocated to ES_HEAP_SIZE)
1.4TB iSCSI Volume with 700GBs used
Swap is disabled.
Well, it basically means that you've got 1000 search requests that have queued up waiting to run, and once the limit is reached ES just starts aborting new requests.
So you'll need to figure out the bottleneck. Some options:
- Your clients are simply sending too many queries too quickly in a fast burst, overwhelming the queue. You can monitor this with Node Stats over time to see if it's bursty or smooth
- You've got some very slow queries which get "stuck" for a long time, eating up threads and causing the queue to back up. You can enable the slow log to see if there are queries that are taking an exceptionally long time, then try to tune those
- There may potentially be "unending" scripts written in Groovy or something. E.g. a loop that never exits, causing the thread to spin forever.
- Your hardware may be under-provisioned for your workload, and bottlenecking on some resource (disk, cpu, etc)
- A temporary hiccup from your iSCSI target, which causes all the in-flight operations to block waiting for the disks to come back. It wouldn't take a big latency hiccup to seriously backup a busy cluster... ES generally expects disks to always be available.
- Heavy garbage collections could cause problems too. Check Node Stats to see if there are many/long old gen GCs running
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.