we are using Elasticsearch and Fluentd for Central logging platform. below is our Config details: Elasticsearch Cluster:
Master Nodes: 64Gb Ram, 8 CPU, 9 instances
Data Nodes: 64Gb Ram, 8 CPU, 40 instances
Coordinator Nodes: 64Gb Ram, 8Cpu, 20 instances
Fluentd: at any given time we have around 1000+ fluentd instances writing logs to Elasticsearch coordinator nodes. and on daily basis we create around 700-800 indices and which total to 4K shards on daily basis. and we keep maximum 40K shards on cluster. we started facing performance issue on Fluentd side, where fluentd instances fails to write logs. common issues are :
1. read time out
2. request time out
3. {"time":"2021-07-02","level":"warn","message":"failed to flush the buffer. retry_time=9 next_retry_seconds=2021-07-02 07:23:08 265795215088800420057/274877906944000000000 +0000 chunk=\"5c61e5fa4909c276a58b2efd158b832d\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"logs-es-data.internal.tech\\\", :port=>9200, :scheme=>\\\"http\\\"}): [429] {\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"}],\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"},\\\"status\\\":429}\"","worker_id":0}
looking for guidance on this, how we can optimise our Logs cluster?