Elasticsearch and Fluentd optimisation for log cluster

we are using Elasticsearch and Fluentd for Central logging platform. below is our Config details: Elasticsearch Cluster:

Master Nodes: 64Gb Ram, 8 CPU, 9 instances
Data Nodes: 64Gb Ram, 8 CPU, 40 instances
Coordinator Nodes: 64Gb Ram, 8Cpu, 20 instances

Fluentd: at any given time we have around 1000+ fluentd instances writing logs to Elasticsearch coordinator nodes. and on daily basis we create around 700-800 indices and which total to 4K shards on daily basis. and we keep maximum 40K shards on cluster. we started facing performance issue on Fluentd side, where fluentd instances fails to write logs. common issues are :

 1. read time out
 2. request time out
 3. {"time":"2021-07-02","level":"warn","message":"failed to flush the buffer. retry_time=9 next_retry_seconds=2021-07-02 07:23:08 265795215088800420057/274877906944000000000 +0000 chunk=\"5c61e5fa4909c276a58b2efd158b832d\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"logs-es-data.internal.tech\\\", :port=>9200, :scheme=>\\\"http\\\"}): [429] {\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"}],\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"},\\\"status\\\":429}\"","worker_id":0}

looking for guidance on this, how we can optimise our Logs cluster?

Which version of Elasticsearch are you using? What is the full output of the cluster stats API?

This sounds excessive and inefficient. Does this mean you only keep 10 days worth of data in the cluster? Why are you creating so many indices per day? How much data does the cluster hold?

we are on 7.12.
we have around 100-200 different projects and env thats why 700-800 indices.
and yes we keep only last 10 days worth of data in the cluster.

Documents: 14,941,337,790
Disk Usage: 3.4 TB
Primary Shards:17,460
Replica Shards:14,900

Having over 32,000 shards for just 3.4TB of data is very, very inefficient and most of your shards are going to be tiny. I would recommend stopping having separate indices per project and reduce the shard count significantly. Please read this blog post for some practical guidelines on sharding.

If you consolidated your indices and reduced your shard count in line with the recommendations is the blog post I linked to you should be able to reduce the size of your cluster significantly, maybe down to 3-5 data nodes.