Losing messages during high traffic rate

I am doing log analysis using Filebeat (1.2) -> logstash(2.3) -> Elasticsearch (2.3)

I have 4 filebeat instances, 4 logstash instances (6 cores each) , and Elasticsearch cluster (8 cores, 64G RAM) of 2 nodes

In Filebeat.yml logstash setting is as such:

filebeat:
prospectors:
-
paths:
- /var/log/filebeat//.json
encoding: utf-8
input_type: log
ignore_older: 10m
scan_frequency: 1s
exclude_lines: ["^$"]
spool_size: 3072
registry_file: .filebeat

output:
logstash:
enabled: true
hosts: ["logstash1:5044","logstash2:5044"]
worker: 8
loadbalance: true
index: elkstats_record

Logstash’s Elasticsearch output setting is like this

    elasticsearch {
       hosts => ["node1", "node2"]
       index => "records_%{+YYYY.MM.dd}" # generate 1 index every month
       template_name => "template"
       document_id => "%{[@metadata][computed_id]}"  # set documented
       workers => 2
       flush_size => 3500
    }

Logstash host environment variable for heap size is set to $LS_HEAP_SIZE = 2048M

Normally, if the traffic is 3000-4000 msg/sec, when the traffic exceeds 6000-7000 msg/sec, logstash side will have these repeated messages:

CircuitBreaker::rescuing exceptions {:name=>"Beats input", :exception=>LogStash::Inputs::Beats::InsertingToQueueTakeTooLong, :level=>:warn}
Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover. {:exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:warn}

In system monitor tool, it shows logstash instance uses 50 - 80% of the cpu during the high traffic hours, but only 800MB RAM.

Elasticsearch side does not have obvious resource shortage. CPU usage is 15%-26%, memory usage is 26%

On elasticsearch error logs, it shows: org.apache.lucene.store.AlreadyClosedException, I have more details in this link

It looks like logstash is not processing messages fast enough, and filebeat takes long time to insert to logstash's queue. Then we end up losing the messages which are not inserted from filebeat to logstash queue.

Is there any way to tune logstash or filebeat (flush_size, workers) so that Logstash will process messages faster?
Also how to let logstash make use of all the LS_HEAP_SIZE of 2G to cache the unprocessed messages, instead of only 800MB?
Is there any other way to prevent logstash from losing messages?

How many Logstash filter workers for each LS instance?

Have you tried to increase the pipeline batch size for LS with the -b switch? The default one of -b 125 is pretty low. Try to increase it gradually to see if it helps. Mine is set at -b 1500 or -b 3000 depending on the throughput.

Try increasing this number to match the number of LS worker filters. It shoud be at least 6 since your LS instance has 6 CPU cores.

How many LS filters do you have? Too many filters hurt LS processing capacity.

1 Like

Thank you for the prompt suggestions. Setting the batch size is very effective, there is no lost message after setting the appropriate batch size.

Can you also provide some link of article about how to optimise LS processes?

This may be helpful https://www.elastic.co/guide/en/logstash/2.3/pipeline.html. If possible, set Logstash output to /dev/null and test the filter workers and pipeline batch size first. Use LS metrics plugin to see how many msgs LS can handle.