ES2.4.6 - 22 Nodes m4.2xlarge data cluster max cpu and threads

Hi All,

Some details about our issue currently seen, we are at a loss and looking for help

  • We run a 22 data node cluster, on aws EC2, on m4.2xlarge
  • We are not memory bound when our issue occurs (ES gets approx 32GB allocated)
  • This cluster has 3 clients and 3 masters (on m4.xlarge)
  • Each data node has 1.5TB of space, we are not near max

We have also seen this issue when we ran 8 nodes of the same type (we scaled out to try to address our issue)

Attached screenshot shows the current issue we are experiencing, whereby:

  • Query (or queries) hit the cluster, causing max threads, following by max cpu, followed by max queue
  • This behaviour lasts for up to 15 minutes and has no pattern (sometimes few times a day, then not for weeks)
  • We have analyse our queries to some degree with the profile API, but have not made any headway

Shard sizes: 15-17GB
No of shards: 100 (primary)
No of replica shards: 1 (making 200 total shards)
Document count (total) 3.74billion

Happy to post any more detail (within reason). This is causing intermittent outages of up to 15 minutes, multiple times per day. There is no discernable reason on the AWS side that we can see that is causing this (i.e., no VM issues)

Hey,

have you seen merged happening during that time? Is I/O going up? Is there anything in the log files of the affected nodes? Checked kernel dmesg did the latency of just pinging those nodes change during that time? What about the hot threads output?

Also, just a word of warning, ES 2.4 is end of life, and we are two major versions in front with a ton of new feature and bugfixes, so upgrading might make sense (even though I understand it may be complicated).

--Alex

m4.2xlarge? is that correct?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.