Logstash Mixing Throttled and Non-Throttled inputs

Hi,

Intro

We use Logstash with both throttled and unthrottled inputs.
There's already an issue on github, but per suyograo recommendation, I'm opening the question here as well.

Components configuration:

  • Elasticsearch 1.7.0 2x6-core Intel Xeon 2.0GHZ with HT (total of 24 cores), 30GB heap, 10x1TB partitions.
    • java version "1.7.0_79" OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-1~deb8u1) OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
    • Our cluster consists of one data+master node and a search-load-balancer node (master/data=false)
  • Logstash 1.5.2, OpenVZ container, Debian 7, 8GB memory, 3GB heap, 64 filter threads.
    • java version "1.7.0_75" OpenJDK Runtime Environment (IcedTea 2.5.4) (7u75-2.5.4-1~deb7u1) OpenJDK 64-Bit Server VM (build 24.75-b04, mixed mode)
    • Lumberjack input plugin is latest version and includes the circuit-breaker mechanism.
    • Logstash configuration is located in 2nd reply or in the Github issue due to post body size limitation.

Current Behaviour

In times of low throughput (outside heavy input hours) the event flow seems balanced between CDN logs, and all other log types.
During heavy input hours, we suffer from serious degradation in lumberjack inputs (Nginx / Rails / Postfix), whereas TCP input (CDN log) throughput stays pretty much the same.

These images show the discrepancy between TCP logs and other (lumberjack) logs for a given time period:
TCP records:

Lumberjack (Nginx) records:

Nginx event count, by production host

NB from around 21:00, the TCP logs drop off completely for some time, and the lumberjack input is much lower from then on. This is detailed in the last screenshot above, where each colour is a production app server (we have four high volume production app servers on lumberjack inputs, as well as our low-volume staging servers - each one outputs both nginx and rails via logstash-forwarder). It’s interesting (and possibly relevant) that prior to the hiccough, all four servers were pretty much throwing an equal number of log information, but afterwards, the proportions seemed to change.

Logstash exceptions are located in 3rd reply or in the Github issue due to post body size limitation.

Here’s what we assume might be happening, based on the behaviour described above:

  • The TCP input which has no circuit-breaker hammers the pipeline and blocks (or throttles) the lumberjack input from pushing events. When lumberjack input can’t push event to its filter event-queue, it refuses new connections.
  • The lumberjack circuit-breaker limits shippers one by one, starting with the heavy hitters. Logic is equivalent or at least similar to TCP - repeated cuts in half the throughput of the top abusers until things stabilize, and then a slow raising of limits.

Out of all this, comes the following:

  1. Is there any basis for our assumptions above?
  2. Is our use case sane? Is such a setup (mixing circuit-breaker and non-circuit-breaker inputs) sane?
  3. Can any logstash developers confirm or deny our hypotheses above?

Bottom line:
What do we have to do to enable our Logstash instance to cope with this? We’re happy to answer any requests for more detail on our setup, and try any suggestions that the community may have about how to move forward.

Thanks,
Yarden and DevOps Team

1 Like

Below is Logstash configuration:

input {
lumberjack {
host => "10.10.10.10"
port => 5230
ssl_certificate => '/ssl/logstash-forwarder.crt'
ssl_key => '/ssl/logstash-forwarder.key'
}
tcp {
port => 5514
type => "CDN"
ssl_cert => "/ssl/logstash-forwarder.crt"
ssl_key => "/ssl/logstash-forwarder.key"
ssl_enable => true
codec => line {
charset => "ISO-8859-1"
}
}
}
filter {
Grok filters for Postfix, CDN, Nginx and application logs
General structure is:
if [type] == “<SOME_TYPE>” {
process message with grok and other filter types.
}
}
output {
if [@metadata][type] in [ "postfix", "CDN"] {
elasticsearch {
host => [ "10.10.0.20" ]
protocol => "transport"
port => "9300"
cluster => "elasticsearch"
index => "%{[@metadata][type]}-%{+YYYY.MM.dd}"
document_type => "%{[@metadata][type]}"
manage_template => false
}
}
# 3 more elasticsearch outputs to handle the rest of the event types.
else {
file {
path => "/var/log/logstash/unknown_messages.log"
}
}
}

Logstash exceptions:

{:timestamp=>"2015-07-20T18:48:43.612000+0000", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2015-07-20T18:48:43.613000+0000", :message=>"Exception in lumberjack input thread", :exception=>#<LogStash::CircuitBreaker::OpenBreaker: for Lumberjack input>, :level=>:error}
{:timestamp=>"2015-07-20T18:48:43.666000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}
{:timestamp=>"2015-07-20T18:48:43.667000+0000", :message=>"CircuitBreaker::Open", :name=>"Lumberjack input", :level=>:warn}
{:timestamp=>"2015-07-20T18:48:43.668000+0000", :message=>"Exception in lumberjack input thread", :exception=>#<LogStash::CircuitBreaker::OpenBreaker: for Lumberjack input>, :level=>:error}
{:timestamp=>"2015-07-20T18:48:43.729000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-20T18:48:44.229000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-20T18:48:44.730000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-20T18:48:45.231000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}
{:timestamp=>"2015-07-20T18:48:45.251000+0000", :message=>"CircuitBreaker::rescuing exceptions", :name=>"Lumberjack input", :exception=>LogStash::SizedQueueTimeout::TimeoutError, :level=>:warn}

Hi Elastic team,
Do you have any ideas or thoughts about the questions raised in this ticket?

Hoped this discussion will be more attractive...

Thanks,
Yarden

Hi Yarden,
we have the same error message. Nothing changed on our systems, there was just a restart on all our system.
Dear Elastic, we'll be happy for a discussion.

2 Likes

I'm chipping in as the OP's colleague. We're relatively new at ELK, 6-months in, from a POC to (nearly) Production, and the scope of our project has widened as we realized what the ELK could do for us. :smile:

After much tweaking and hardware upgrades, we've squashed all bottlenecks and issues apart from this one, IOW this is the last blocker before our ELK stack becomes production-ready. It's an issue that we hoped would attract interest, and we're frustrated at the lack of discussion.

Have we made mistakes here? Is our use case bizarre? Was our OP too long? Or did this just fall through the cracks?

Hey Bastian, we replied at the same time. :slight_smile:

Out of interest, do you have the same use case of throttled / non-throttled inputs? Can you describe more about your setup?

1 Like

Hi Logstash team,
We thought this thread will gain some traffic, but for some reason it didn't. It's really blocking us.

Can anyone give a timeframe on when this issue will get a proper visit from dev or some other input on how to proceed?

Thanks,
Yarden

@jordansissel, continuing our discussion from [here][1] , can you please elaborate (maybe with en example) on how the OP behaviour might be a configuration issue ?

I agree it's weird that tcp input continues to function normally while lumberjack does not, but I suspect this is something you can fix with the lumberjack configuration. My guess, for now, is that this is a configuration issue, and it may be best discussed on discuss.elastic.co - I'm not sure there's a bug just yet.

Thank you,
Yarden
[1]: Logstash Mixing Throttled and Non-Throttled inputs · Issue #3642 · elastic/logstash · GitHub

Have you considered a broker to buffer things?

As logstash-forwarder reads data from file it can just stop reading data from file when it encounters back pressure from Logstash, generally without losing any data. The TCP input plugin however handles back pressure differently, and it will stop reading until the back pressure has been cleared, eventually filling up buffers. Depending on how the system sending data to the plugin handles this, it could lead to data loss.

Isn't it therefore a desirable behaviour that the TCP input plugin gets priority, as reported by the OP, as this could help reduce the risk of data loss?

If this behaviour is not desired, there are a couple of workarounds. One is to split the configuration into two and run a separate Logstash instance for each input, so that they can handle back pressure independently. The other one is to, as Mark suggests, introduce a message queue able to buffer data when there is a spike in volumes or an error down the line, reducing the risk of back pressure affecting the input plugins.

Hi @warkolm, @Christian_Dahlqvist,
Introducing a broker to the single Logstash deployment will create an increasing lag between related events (coming from different sources) due to the fact that the event count from our CDN is much higher than our internal logs event count. it is true that event ingestion eventually catches up, but its disabling our BI/Monitoring/other systems to rely on this 'live' data.

We decided long time ago (see the '1 month later' mark :slight_smile: ) to deploy two Logstash instances, one to handle TCP traffic from CDN provider and the second to handle internal logs. Due to unfortunate chain of events we didn't had the chance to test this deployment, but we will update as soon as we have the results/insights in hand.

Why not have a broker anyway and then have multiple queues on that, one for CDN and one for other?

Assuming that you're referring to the 'single' Logstash deployment (handling both CDN and application logs), Can you elaborate on how letting Logstash read messages from a broker will enable a more balanced log ingestion?
In our scenario, the CDN shippers are generating a vast event count which will pile up on the broker.

We're currently strengthening our ES cluster, including Logstash machines, and will provide update after testing a dual headed Logstash deployment, one instance for the CDN logs and another one for application logs.

Right, but you can point multiple LS instances to the same broker+queue.