All Primary Nodes Maxed out CPU

Erik_Miller · May 11, 2020, 3:57pm

Hi All, I'm running ES version 2.4 and am having the majority of the nodes become maxed out on cpu for 15-30 minutes but eventually come back down to normal usage ~50% load. When this occurs the entire cluster becomes extremely slow or searches just fail. Looking through the logs I have come across a pattern to the logs...

    [2020-05-11 15:21:35,003][DEBUG][action.search            ] [Madame MacEvil] failed to reduce search
    Failed to execute phase [fetch], [reduce] 
    <a bunch of backtrace lines...>
    Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.search.SearchQueryThenFetchAsyncAction$2@3a2e6e2e on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@37659bc5[Running, pool size = 13, active threads = 13, queued tasks = 1000, completed tasks = 59285159]]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)

The cluster is 10 nodes, 64G, 8 core ec2 instances, spread across 64 partitions, (32 pimary, 32 replicated). Each node has 7 or 8 partitions. The cluster is update heavy, frequently updating millions of objects in an hour.

I'm not sure how to diagnose the issue. Is the problem simply that there may be too much load? If so, is there a way to alleviate it without adding more nodes. I just went through the process of recreating the cluster and expanding it with 2 more nodes and am still seeing issues occur.

Christian_Dahlqvist · May 11, 2020, 9:47pm

What type of instances are you using? What type of storage? Have you monitored disk utilisation and iowait during periods of high CPU usage?

Erik_Miller · May 12, 2020, 1:19pm

I'm using m5.2xlarge instances with 1TB gp2 block storage. IOPS are set at 3000/sec with max throughput of 250MB/sec. I didn't capture any info on iowait / disk utilization directly from the servers. I checked out the basic cloadwatch data and they seemed pretty normal. Is there any way to view that data from log files?

system · June 9, 2020, 1:19pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Continuous high CPU and load + EsRejectedExecutionException Elasticsearch	2	488	November 8, 2018
Very high CPU usage of ES in idle state Elasticsearch	3	1567	July 6, 2017
Performance issue on some requests Elasticsearch	4	1654	April 10, 2017
Elasticsearch high cpu usage Elasticsearch	2	370	July 6, 2017
Elasticsearch high cpu usage Elasticsearch	1	347	July 6, 2017

All Primary Nodes Maxed out CPU

Related topics