Hi All, I'm running ES version 2.4 and am having the majority of the nodes become maxed out on cpu for 15-30 minutes but eventually come back down to normal usage ~50% load. When this occurs the entire cluster becomes extremely slow or searches just fail. Looking through the logs I have come across a pattern to the logs...
[2020-05-11 15:21:35,003][DEBUG][action.search ] [Madame MacEvil] failed to reduce search Failed to execute phase [fetch], [reduce] <a bunch of backtrace lines...> Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction$2@3a2e6e2e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@37659bc5[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 59285159]]] at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50) at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
The cluster is 10 nodes, 64G, 8 core ec2 instances, spread across 64 partitions, (32 pimary, 32 replicated). Each node has 7 or 8 partitions. The cluster is update heavy, frequently updating millions of objects in an hour.
I'm not sure how to diagnose the issue. Is the problem simply that there may be too much load? If so, is there a way to alleviate it without adding more nodes. I just went through the process of recreating the cluster and expanding it with 2 more nodes and am still seeing issues occur.