Slow or stalled pipeline

Hi

We are using the ELK Stack (ES, LS, Kibana, Filebeat) for a big customer project.
Currently we're still using it with an hourly-scheduled file-input configuration (where we read from the logfiles over network shares).
However, we want to switch it to the newer Filebeat based configuration.

Our setup:

  • Machine 1: Elasticsearch Node 1 + Logstash Instance
  • Machine 2: Elasticsearch Node 2
  • Machine 3-n: Filebeats delivering data to Logstash

Versions + Configuration Files:

The setup is working fine to some extent, meaning:
Basic network connectivity is fine, the filebeats are delivering data to LS, and LS starts delivering data to ES.
However, we are facing the problem that Logstash a few minutes after startup and pushing some data to ES starts logging the below lines multiple times and no more data is pushed to ES:

{:timestamp=>"2016-07-25T10:23:36.119000+0200", :message=>"Beats input: The circuit breaker has detected a slowdown or stall in the pipeline, the input is closing the current connection and rejecting new connection until the pipeline recover.", :exception=>LogStash::Inputs::BeatsSupport::CircuitBreaker::HalfOpenBreaker, :level=>:warn}
{:timestamp=>"2016-07-25T10:24:41.185000+0200", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Beats input", :exception=>LogStash::Inputs::Beats::InsertingToQueueTakeTooLong, :level=>:warn}

We tried several things:

  • setting process priority of LS to low while ES has normal process priority
  • set LS to push data to ES node 2 instead of node 1
  • set only one Filebeat to deliver data (so reduce data amount)

Would you have an idea what is going wrong?

P.S. I couldn't paste the configuration here as the post would then grow over 5000 (of the allowed) characters.

logstash beats input plugin has a circuit breaker closing connetions if event pipeline can not be processed within 5 seconds. If ES is taking longer then 5 seconds to index, logstash will drop events and filebeat has to retry sending those.

I'd recommend to disable the circuit breaker by setting congestion_threshold to a few days, months, years.

Beats in addition employs some connection timeout too (default 30s). If ES might take longer from time to time (e.g. due to garbage collecting) increase the timeout in beats (e.g. to 2 min in extreme scenario). In some cases some grok filters can take looooong too. Have you tried to collect some base throughput numbers like:

  • filebeat -> file/console(/dev/null)
  • filebeat -> logstash -> stdout (use dots codec and pr tool to get some throughput stats)
  • ES indexing throughput

Besides disabling the circuit-breaker, Tuning ES and logstash should get you the most benefits.

The beats input plugin is currently rewritten in java, also removing the circuit-breaker. I hope this will help in future with performance and spurious slowdowns.

P.S: I didn't check the configs as it's a pain to open them. Use grep or sed to remove comments from config (and or some other paste-service). I'm mostly interested in beats output config, and maybe logstash input config.

Hi Steffen

Thank you for the reply.

I will try your suggestions with the circuit breaker and the timeout.

About the throughput numbers, I haven't collected any of them yet but it's a good point as well to check.

I'll let you know about the progress.

P.S. I had already removed the comments from the config files :wink:

Update 15.09.2016:
What we've done so far:

  • Throughput analysis of Logstash with the metrics plugin (with Logstash just outputting parsed content to $null and metrics being logged to console). With this we were able to optimize worker threads and batch size.
  • Configured Logstash to output to both of our Elasticsearch nodes (instead of just to the master node)
  • Optimized regex patterns in grok. In our case we had one case where we were parsing the message through three consecutive regex patterns. We managed to reduce the whole regex to one pattern which increased speed a lot.
  • Tried some performance analysis of Elasticsearch with a trial version of sematext SPM. As our whole Elastic environment is running on Windows the possibilities of performance analysis with sematext however where quite restricted as it is a Linux-Only tool. We got some cluster insights on indexing etc. but no system specific data (as CPU, RAM, JVM usage). For a whole picture this would have been helpful.

However, as an outcome of the Elasticsearch performance analysis:

  • Split the Elasticsearch indices by application type and environment (where type and environment are two custom fields in our setup). This allows us to close indices in a more fine-grained way and thus relieving RAM usage and amount of queried data
  • indices.fielddata.cache.size: 30%
    Defaults to unbounded. Previous value was 75% but reason for this is unknown. Elastic recommends setting it to 30%.
  • indices.fielddata.cache.expire: 60m
    This setting ages unused field data. This maybe resolves or lowers the issue we had that ES consumes all the available RAM after running for some time.
  • indices.breaker.fielddata.limit: 85%
    Restored to default 60%
  • index.number_of_shards: 2
    Increased to 2 for potential better query performance.
  • index.merge.scheduler.max_thread_count: 1
    As our infrastructure is HDD and not SSD based.
  • index.refresh_interval: 30s
    Default is 1 second, data is available in search queries 1 second after pushing it to the index. This generates unnecessary load as we do not require real-time data. Therefore, increased to 30 seconds.

This topic was automatically closed after 21 days. New replies are no longer allowed.