Best Way to Ensure Elasticsearch Optimal Performance

Hello,

I am writing to seek guidance on how to ensure optimal performance of Elasticsearch cluster.

I am running a three node ES cluster with huge resources (16vCPUs, 128GB RAM, 10 TB Disk) on each node.
However, i am collecting logs from 300+ servers and the index is created in the format, (ilm is auto), is index-name-{now/d}-000001. This basically creates an index every day and rolls over index like every 30 days or when it gets to 50G.
however, the indices are growing way too large (more than 3000 shards currently) and I feel like this is affecting the cluster as depicted by constant timeouts when executing some queries or even trying to save some settings like creating new ingest pipelines which always give 504 gateway timeout.
Elasticsearch logs show timeout in connecting to other cluster nodes.
Any idea to optimize my cluster?

I hope i communicated the issue well. Pardon me if I didn't.

What is the full output of the cluster stats API?

What is the retention period for your data?

What type of storage are you using? Local SSDs?

Hi @Christian_Dahlqvist
Please see the info below;
What is the full output of the cluster stats API?
Sorry not able to get this output currently but here is the output from the command i executed some time today.

{
  "_nodes" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "cluster_name" : "elk",
  "cluster_uuid" : "vsdHTS8LQ2GRlsX7XQr_9Q",
  "timestamp" : 1711543091000,
  "status" : "yellow",
  "indices" : {
    "count" : 1604,
    "shards" : {
      "total" : 2603,
      "primaries" : 2543,
      "replication" : 0.023594180102241447,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 6,
          "avg" : 1.6228179551122
        },
        "primaries" : {
          "min" : 1,
          "max" : 3,
          "avg" : 1.5854114713216958
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.03449709060681629
        }
      }
    },

What is the retention period for your data?
1 yr
What type of storage are you using? Local SSDs?
datacentre HDD

The stats output is only partial, so not much to comment on there. Will need to see the rest to draw any conclusions.

Elasticsearch is often limited by disk I/O. I would recommend running iostat -x on the nodes and see what await looks like. I would not be surprised if this is your main issue. Note that the use of SSDs is recommended in the search performance tuning guide as well as the guide for optimizing indexing throughput.

It may also be worthwhile looking into how you handle sharding and index data. If you are actively writing to a significant number of indices and shards this may result in a lot of small writes and resulting IOPS, which may not be ideal for slower disks.