We are experiencing a low searching performance in our Kubernetes ES cluster. We have a 5 node ES cluster with the following specifications per node:
ES_JAVA_OPTS: -Xms4096m -Xmx4096m
Cpu request: 500m
Cpu limit: 8
Memory request: 8Gi
Memory limit: 8Gi
Our cluster is running on AWS EC2 i3.4xlarge instances which provide:
Networking Performance: Up to 10 Gigabit
Storage (TB): 2 x 1.9 NVMe SSD.
We are using index lifecycle policies to roll over when the index reaches 125GiB (we create indices with 5 primary shards following best practices to keep the shards at a proper size).
We are ingesting around 125GiB per day and we keep about 2TB of data stored. Our data is fairly balanced between the nodes:
shards disk.indices node 18 373.7gb node-0 17 463.6gb node-1 18 417.4gb node-2 18 311.1gb node-3 18 326.8gb node-4
All nodes have the same role (master data nodes).
When searching for our last 7 days data, the query takes around 1 minute. However when trying to retrieve the last 15 days data, we got a timeout after 2 or 4 minutes (I don't understand why the timeout time changes since it always has the same configuration).
How could we improve the performance? What metrics could we look at to know why our queries are slow?
The resource usage when performing the queries is as follows:
The nodes do not exceed 5 cpu even when there is cpu available in the instance.