Lots of Beats input: The circuit breaker has detected a slowdown or stall in the pipeline

Hi,

my setup is

filebeat -> logstash(localhost) -> elasticsearch (remote host)

and im expiriencing a lot of missing packages from filebeat.
The filebeat logs report back

2016-06-09T10:34:00+01:00 INFO backoff retry: 1m0s
2016-06-09T10:35:05+01:00 INFO Error publishing events (retrying): EOF
2016-06-09T10:35:05+01:00 INFO send fail
2016-06-09T10:35:05+01:00 INFO backoff retry: 1m0s
2016-06-09T10:36:15+01:00 INFO Error publishing events (retrying): EOF
2016-06-09T10:36:15+01:00 INFO send fail
2016-06-09T10:36:15+01:00 INFO backoff retry: 1m0s
2016-06-09T10:37:20+01:00 INFO Error publishing events (retrying): EOF
2016-06-09T10:37:20+01:00 INFO send fail
2016-06-09T10:37:20+01:00 INFO backoff retry: 1m0s

and the logstash reports back

{:timestamp=>"2016-06-09T10:35:05.926000+0100", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2016-06-09T10:36:15.387000+0100", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::Inputs::Beats::InsertingToQueueTakeTooLong, :level=>:warn}
{:timestamp=>"2016-06-09T10:36:15.388000+0100", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2016-06-09T10:37:20.406000+0100", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::Inputs::Beats::InsertingToQueueTakeTooLong, :level=>:warn}
{:timestamp=>"2016-06-09T10:37:20.407000+0100", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:warn}

My logstash instance occupies 257 MB of RAM (average) and the filebeat is at 20.8 MB.
My elasticsearch has about 1.1M events per hour and it has 6 beefed up nodes with the following stats


I've read that this is a scaling issue but im not sure what to scale
Any help is appreciated.

Im running the latest of all ELK stat.

The beats input plugin uses a circuit breaker closing connections if the input plugin can not push events to the pipeline. The default timeout of the circuit breaker is 5 seconds. In addition beats might break connections and resend if logstash is unresponsive for N seconds (default = 30 seconds, I think).

Related config options:

  • congestion_threshold in logstash
  • timeout in filebeat
  • (optional) bulk_max_size in filebeat. Reducing bulk size has little effect in logstash, but ACKs might be returned earlier from logstash reducing the chance of timeouts in filebeat.

I'd recommend to set congestion_threshold to X (very large number) years, in order to disable the circuit breaker + set timeout in filebeat to some higher acceptable value. e.g. at least twice the max timeout in logstash outputs times per event processing overhead, given the problem is not slow filters (e.g. 120 seconds). Monitor filebeat logs (info level) or logstash regarding reconnects and update timeout in beats accordingly.

The root cause is most likely due to output not being very responsive/slow or some logstash filter stalls/slowdowns (e.g. inefficient grok filter).

I will give the congestion_threshold a try with the setup you described - thanks.

My grok filters (its the only processing i have in logstash config) is the following

    } else if [source] == "/var/log/upstart/webservices.log" {
        grok {
            match => {
                "message" => ".*Average per file: %{NUMBER:webservice_per_file_ms1:int} ms"
            }
        }
        mutate {
            remove_tag => [
                "beats_input_codec_plain_applied"
            ]
            remove_field => [
                "offset",
                "count",
                "audit_type",
                "message",
                "input_type",
                "beat"
            ]
        }
    }

Can you see anything that could use with optimising in the above snippet ?

I have no idea about logstash (grok) filter optimization. Did you try to measure throughput in logstash itself? Maybe someone in logstash forum can help in case increasing congestion_threshold doesn't work.

I've run overnight with congestion_threshold => 99999999 in logstash and timeout: 320 in filebeat and it was smooth.
I didnt got any circuit breaker and the events seem to be in the elasticsearch.

So that means the elastic clusters are all right but there is something to be done with the logstash ? Or am i completely wrong ?

Thats good. Circuit-breaker issues in logstash is a known probm. There is some work in improving/rewriting the beats input plugin and removing the circuit breaker here: https://github.com/logstash-plugins/logstash-input-beats/issues/92

This topic was automatically closed after 21 days. New replies are no longer allowed.