Elasticsearch timeout for search query


Elasticsearch version 7.6.1

We have a total of 55 indices with 228 shards and a disk space of 4.8 TB

Our indexing rate ranges from (3000-8000 docs per second) with a total of 250 million docs coming per day.
We have 10 data nodes running each with 2 cores and 5 GB of RAM (50% heap) capacity and we have a total of around 4.5 billion documents at the moment:

Here's out heap usage in the past 1 week (max heap is 25 GB):

When I run a query to get all the data (4.8 billion docs), the query passes sometimes and fails sometimes irrespective of the amount of indexing happening at that time.
CPU utilization on all the nodes reach 100% when the query is running

No-one is running queries on the cluster except for me.
search thread pool queue count doesn't cross 60 when the query is running:

Even though the official docs insist on setting the heap to 50% of available RAM, it seems like we are not using most of the heap available, do you think the search would improve if I decrease the heap size to 1.5GB?
OR is there another way to improve search performance?

What type of query do you run to get all data? How does it fail? Have you looked at disk utilization and iowait while you are running the query?

Also in most of our nodes JVM usage looks like "saw-tooth", AFAIK this is over-allocation of RAM for JVM...right?

I feel like this cannot be an IO issue, we just re-built the cluster and the nodes are able to read like 1GB per second some times, I mean to say that the nodes are not utilizing the full potential of the EBS volumes when I ran a query to get the data from the last 30 days

That saw tooth pattern is what you are looking for and very healthy. You look to have a reasonable heap size.

I am not sure about the Kibana error though.

Can you capture the query being sent and try running this from the dev console in Kibana? Enabling the slow log and looking at what this shows might also be a good idea.

