ES2.4.6 - 22 Nodes m4.2xlarge data cluster max cpu and threads

Tim_Curtin · February 27, 2018, 4:06am

Hi All,

Some details about our issue currently seen, we are at a loss and looking for help

We run a 22 data node cluster, on aws EC2, on m4.2xlarge
We are not memory bound when our issue occurs (ES gets approx 32GB allocated)
This cluster has 3 clients and 3 masters (on m4.xlarge)
Each data node has 1.5TB of space, we are not near max

We have also seen this issue when we ran 8 nodes of the same type (we scaled out to try to address our issue)

Attached screenshot shows the current issue we are experiencing, whereby:

Query (or queries) hit the cluster, causing max threads, following by max cpu, followed by max queue
This behaviour lasts for up to 15 minutes and has no pattern (sometimes few times a day, then not for weeks)
We have analyse our queries to some degree with the profile API, but have not made any headway

Shard sizes: 15-17GB
No of shards: 100 (primary)
No of replica shards: 1 (making 200 total shards)
Document count (total) 3.74billion

Happy to post any more detail (within reason). This is causing intermittent outages of up to 15 minutes, multiple times per day. There is no discernable reason on the AWS side that we can see that is causing this (i.e., no VM issues)

spinscale · February 27, 2018, 1:03pm

Hey,

have you seen merged happening during that time? Is I/O going up? Is there anything in the log files of the affected nodes? Checked kernel dmesg did the latency of just pinging those nodes change during that time? What about the hot threads output?

Also, just a word of warning, ES 2.4 is end of life, and we are two major versions in front with a ton of new feature and bugfixes, so upgrading might make sense (even though I understand it may be complicated).

--Alex

morphers82 · February 27, 2018, 1:06pm

m4.2xlarge? is that correct?

system · March 27, 2018, 1:06pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data node high CPU Elasticsearch	19	3645	February 26, 2018
CPU utilization of the whole cluster spikes up to 100% suddenly Elasticsearch	6	11728	July 5, 2017
ES 2.4 to 5.2 Upgrade Followed By Major Cluster Instability Elasticsearch	24	3370	April 26, 2017
ES locks up and eat the heap Elasticsearch	18	1096	July 6, 2017
Abnormally high CPU usage for specific queries/dashboards Elasticsearch	2	99	May 15, 2024

ES2.4.6 - 22 Nodes m4.2xlarge data cluster max cpu and threads

Related topics