All Primary Nodes Maxed out CPU

Hi All, I'm running ES version 2.4 and am having the majority of the nodes become maxed out on cpu for 15-30 minutes but eventually come back down to normal usage ~50% load. When this occurs the entire cluster becomes extremely slow or searches just fail. Looking through the logs I have come across a pattern to the logs...

    [2020-05-11 15:21:35,003][DEBUG][action.search            ] [Madame MacEvil] failed to reduce search
    Failed to execute phase [fetch], [reduce] 
    <a bunch of backtrace lines...>
    Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction$2@3a2e6e2e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@37659bc5[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 59285159]]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)

The cluster is 10 nodes, 64G, 8 core ec2 instances, spread across 64 partitions, (32 pimary, 32 replicated). Each node has 7 or 8 partitions. The cluster is update heavy, frequently updating millions of objects in an hour.

I'm not sure how to diagnose the issue. Is the problem simply that there may be too much load? If so, is there a way to alleviate it without adding more nodes. I just went through the process of recreating the cluster and expanding it with 2 more nodes and am still seeing issues occur.

What type of instances are you using? What type of storage? Have you monitored disk utilisation and iowait during periods of high CPU usage?

I'm using m5.2xlarge instances with 1TB gp2 block storage. IOPS are set at 3000/sec with max throughput of 250MB/sec. I didn't capture any info on iowait / disk utilization directly from the servers. I checked out the basic cloadwatch data and they seemed pretty normal. Is there any way to view that data from log files?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.