How to slow down large amount of data coming from filebeat?

Hi Team,

How will you slow down large amount of data streaming from filebeat to logstash so that it can be processed accurately in filter section.

Thanks and Regards,
Sagar Mandal

Hi @Sagar_Mandal,

that should be mostly automatic. Of course depends on how much data would have to be cached...

Can't find this in any official Elastic documentation but as far as I remember, it is part of the lumberjack protocol that is used between Filebeat and Logstash.

From Send Your Data | Logz.io Docs

One of the facts that make Filebeat so efficient is the way it handles backpressure— so if Logstash is busy, Filebeat slows down it’s read rate and picks up the beat once the slowdown is over.

But from personal experience, if the Logstash filter section is very process heavy, you can still get into trouble. I have killed my Logstash instances with sub-optimal filters, especially GROK filters with poorly written patterns and no anchoring.

okay so the thing is about 100GB of data comes everysingle day to logstash from filebeat and then it goes to a filter section where a lot of conditioning is being done so....yeah.

100GB per day is a lot of data. I would recommend something like:

filebeat --> kafka --> multiple logstash instances --> elasticsearch

Bursts of messages can then be queued in Kafka and multiple Logstash instance can be used to scale the post-processing of that data.

Rob

GitHub YouTube LinkedIn
How to install Elasticsearch & Kibana on Ubuntu - incl. hardware recommendations
What is the best storage technology for Elasticsearch?

We are doing about 300GB of logs (about 400M documents) for 4 x Logstash with 12 CPU cores each. To be fair, load is < 1 at the moment. I did spend a lot of time optimizing our Logstash filters.

We are working on adding Kafka to the mix, not so much to deal with spikes but to be able to queue all messages during maintenance or if for some reason Logstash or Elasticsearch breaks completely.

@A_B as you make the move to Kafka, a few things that will really boost throughput...

  1. increase pipeline.batch.size from the default of 125 to at least 1024 (1280 was best in my environment)
  2. increase pipeline.batch.delay from the default of 50 to at least 500 (1000 was best in my environment)
  3. in the kafka input, set max_poll_records to the same value as pipeline.batch.size
  4. each thread defined by consumer_threads in the kafka input will be an instance of a consumer. So if you have 4 instances with 2 threads, that is 8 consumer instances. Your Kafka topics must have at least 8 partitions for all consumer threads to ingest data. You will want more partitions than your current needs so you can easily scale in the future.
  5. the number of pipeline.workers should be at least equal to consumer_threads.
  6. the kafka output should set batch_size to at least 16384

You may end up tweaking some of the buffer settings as well, but the above will give you a good starting point.

Rob

GitHub YouTube LinkedIn
How to install Elasticsearch & Kibana on Ubuntu - incl. hardware recommendations
What is the best storage technology for Elasticsearch?

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.