Hi All,
Some details about our issue currently seen, we are at a loss and looking for help
- We run a 22 data node cluster, on aws EC2, on m4.2xlarge
- We are not memory bound when our issue occurs (ES gets approx 32GB allocated)
- This cluster has 3 clients and 3 masters (on m4.xlarge)
- Each data node has 1.5TB of space, we are not near max
We have also seen this issue when we ran 8 nodes of the same type (we scaled out to try to address our issue)
Attached screenshot shows the current issue we are experiencing, whereby:
- Query (or queries) hit the cluster, causing max threads, following by max cpu, followed by max queue
- This behaviour lasts for up to 15 minutes and has no pattern (sometimes few times a day, then not for weeks)
- We have analyse our queries to some degree with the profile API, but have not made any headway
Shard sizes: 15-17GB
No of shards: 100 (primary)
No of replica shards: 1 (making 200 total shards)
Document count (total) 3.74billion
Happy to post any more detail (within reason). This is causing intermittent outages of up to 15 minutes, multiple times per day. There is no discernable reason on the AWS side that we can see that is causing this (i.e., no VM issues)