Elasticsearch Super Slow and using low CPU

After gone offline for a day the elastic cluster started to be too slow, check of possible misconfigurations and could not fine anything to make it faster, the cluster is on green and there is not unassing shards, however its using less that 20% of the CPU and it is super slow.

here is and image of the nodes status

and also the cluster status

{
  "cluster_name": "xxx",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 2,
  "number_of_data_nodes": 2,
  "active_primary_shards": 994,
  "active_shards": 1988,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

What are the specs of your nodes?

You have a lot of shards for just two nodes, also, look at the load for your master node, it is at 22, this is very high.

each node has 64G of ram, 16VCPUs and 1TB disk

I don't see it in your post, but what version of Elasticsearch are you running? 8.x receive a good number of improvements with many shards, so if you're running something like 7.x you might want to consider upgrading.

Also, general question, what makes you think the cluster is slow? Is there a specific use case you've seen issues with? Can you provide examples of the issue that "shows" slowness?

im using version 8.3.3 but im planing to update to the latest on the weekend, i say that is slow bc every time i try to do anything in kibana i takes more than normal, looking at alerts, searching for information or loading any dashboards, i think it is Elasticsearch bc is using so low CPU and that was not normal before

What is the heap for each node?

The way to calculate how many shards you should have in a node based on the heap size changed in 8.3, but based on your disk size you seem to have too many small shards.

Assuming that you have something close to 420 GB of data, with 994 shard this would give an average of 422 MB per shard. You should aim to have a shard size between 10 GB and 50 GB.

What is your use case? Do you have time based data?

Also, your CPU usage may be low, but your load is high for your specs, you showed a load of 22 for a 16 cpu node, this could mean that the disk in this node is having some issues with the amount of I/O.

1 Like