Elasticsearch and Fluentd optimisation for log cluster

chitender · July 2, 2021, 5:29pm

we are using Elasticsearch and Fluentd for Central logging platform. below is our Config details: Elasticsearch Cluster:

Master Nodes: 64Gb Ram, 8 CPU, 9 instances
Data Nodes: 64Gb Ram, 8 CPU, 40 instances
Coordinator Nodes: 64Gb Ram, 8Cpu, 20 instances

Fluentd: at any given time we have around 1000+ fluentd instances writing logs to Elasticsearch coordinator nodes. and on daily basis we create around 700-800 indices and which total to 4K shards on daily basis. and we keep maximum 40K shards on cluster. we started facing performance issue on Fluentd side, where fluentd instances fails to write logs. common issues are :

 1. read time out
 2. request time out
 3. {"time":"2021-07-02","level":"warn","message":"failed to flush the buffer. retry_time=9 next_retry_seconds=2021-07-02 07:23:08 265795215088800420057/274877906944000000000 +0000 chunk=\"5c61e5fa4909c276a58b2efd158b832d\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"logs-es-data.internal.tech\\\", :port=>9200, :scheme=>\\\"http\\\"}): [429] {\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"}],\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"},\\\"status\\\":429}\"","worker_id":0}

looking for guidance on this, how we can optimise our Logs cluster?

Christian_Dahlqvist · July 3, 2021, 2:48pm

Which version of Elasticsearch are you using? What is the full output of the cluster stats API?

This sounds excessive and inefficient. Does this mean you only keep 10 days worth of data in the cluster? Why are you creating so many indices per day? How much data does the cluster hold?

chitender · July 5, 2021, 6:58am

we are on 7.12.
we have around 100-200 different projects and env thats why 700-800 indices.
and yes we keep only last 10 days worth of data in the cluster.

Documents: 14,941,337,790
Disk Usage: 3.4 TB
Primary Shards:17,460
Replica Shards:14,900

Christian_Dahlqvist · July 5, 2021, 7:14am

Having over 32,000 shards for just 3.4TB of data is very, very inefficient and most of your shards are going to be tiny. I would recommend stopping having separate indices per project and reduce the shard count significantly. Please read this blog post for some practical guidelines on sharding.

If you consolidated your indices and reduced your shard count in line with the recommendations is the blog post I linked to you should be able to reduce the size of your cluster significantly, maybe down to 3-5 data nodes.

system · August 2, 2021, 7:15am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Poor Performance - Cluster Elasticsearch	4	372	May 31, 2018
How should I configure ELK stack for saving logs everyday? Elasticsearch	2	398	August 21, 2018
Elasticsearch 2.4.6 Performance Optimizations Elasticsearch	2	661	October 24, 2017
Fluentd to elastic Elasticsearch	15	3670	August 13, 2019
Balance between number of indices and shards per index Elasticsearch	2	475	July 6, 2017

Elasticsearch and Fluentd optimisation for log cluster

Related topics