Hi, i will start by sharing details regarding our infrastructure.
We are running multiple Community Edition Elasticsearch clusters in parallel, all of which are connected via the "Remote Clusters" feature to a single instance of Kibana.
Each cluster has 3 master nodes, and a differing amount of data nodes.
Most of the clusters work well and as expected, but one of them (which i will be expanding upon) is having a lot of Load Average/CPU Usage issues when trying to process certain queries.
Cluster details:
Running self-managed on AWS (EC2 + EKS)
Logstashes are running inside EKS
3 Master Nodes (m5n.large EC2)
32 Data Nodes (i3en.2xlarge EC2)
Storage is ephemeral and hosted on SSD's residing on each Data Node.
We have 5TB of Ephemeral storage on each node, which sums up for a total disk space of 160TB, we are currently using 80% of that storage space, and are adding nodes on a weekly/bi-weekly basis.
Each data node has 64GB of RAM and 8 vCPU's - resulting in a total of 2048GB of memory (Heap size on each data node is set to 27GB), and 256 vCPU's
Sharding - we went with the best practice of sizing shards up to 50GB, most of the shards are averaging between 35GB - 50GB.
Indices - are being rolled over daily/weekly, depending on their size
Our issue is this - when using certain queries/dashboards, Load Average increases immensely across all of our Data Nodes:
In the image above you can see the Load Average of our nodes rapidly increasing due to running ONE query.
I sadly cannot share the query itself due to it containing sensitive data, but will provide some technical details regarding it:
Query is searching across data from the last month, so at the time of writing this thread, 15 days.
It has around 30~ search filters.
Each document searched upon has on AVERAGE around a 100~ fields.
We tried many, many different solutions to this issue, and have yet to find one that works.
This causes many issues with time-outs on some of the queries, and our clients cannot use the cluster normally.
We did not expect for a cluster with such a large amount of compute power to choke on a single query, and have exhausted most of our options, which is why we are turning to the forums for help with this issue.
Please let me know what other information i can provide to assist with solving this issue, and thanks in advance