Sqs output errors cause blocked input pipelines

It's become a daily occurrence in the last week where we see a drop in logs in Elasticsearch caused by blocked pipelines in our logstash stack.

We utilize a cluster of logstash for accepting and pushing logs into an SQS queue, and another cluster reading from the queue, filtering, and pushing to elasticsearch. This particular error occurs on the first cluster, and is only resolved by restarting the logstash process (container).

{:timestamp=>"2016-06-14T13:04:34.272000+0000", :message=>"Failed to flush outgoing items", :outgoing_count=>5, :exception=>"AWS::Errors::Base", :backtrace=>["/opt/logstash/vendor/bundle/jruby/1.9/gems/aws-sdk-v1-1.66.0/lib/aws/core/client.rb:375:in `return_or_raise'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/aws-sdk-v1-1.66.0/lib/aws/core/client.rb:476:in `client_request'", "(eval):3:in `send_message_batch'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/aws-sdk-v1-1.66.0/lib/aws/sqs/queue.rb:551:in `batch_send'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-sqs-2.0.4/lib/logstash/outputs/sqs.rb:129:in `flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/buffer.rb:219:in `buffer_flush'", "org/jruby/RubyHash.java:1342:in `each'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/buffer.rb:216:in `buffer_flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/buffer.rb:193:in `buffer_flush'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.22/lib/stud/buffer.rb:159:in `buffer_receive'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-sqs-2.0.4/lib/logstash/outputs/sqs.rb:121:in `receive'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/outputs/base.rb:83:in `multi_receive'", "org/jruby/RubyArray.java:1613:in `each'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/outputs/base.rb:83:in `multi_receive'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/output_delegator.rb:130:in `worker_multi_receive'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/output_delegator.rb:129:in `worker_multi_receive'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/output_delegator.rb:114:in `multi_receive'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/pipeline.rb:301:in `output_batch'", "org/jruby/RubyHash.java:1342:in `each'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/pipeline.rb:301:in `output_batch'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/pipeline.rb:232:in `worker_loop'", "/opt/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-2.3.1-java/lib/logstash/pipeline.rb:201:in `start_workers'"], :level=>:warn}

{:timestamp=>"2016-06-14T13:04:53.142000+0000", :message=>"Lumberjack input: the pipeline is blocked, temporary refusing new connection.", :level=>:warn}

{:timestamp=>"2016-06-14T13:04:53.534000+0000", :message=>"Beats input: the pipeline is blocked, temporary refusing new connection.", :reconnect_backoff_sleep=>0.5, :level=>:warn}

the sqs output conf:

output {
  sqs {
    batch_events => 5
    queue => "${SQS_OUTPUT_QUEUE}"
    region => "${AWS_REGION}"
  }
}

the beats input conf:

input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/pki/tls/certs/logstash-forwarder/lumberjack.crt"
    ssl_key => "/etc/pki/tls/private/logstash-forwarder/lumberjack.key"
  }
}

These nodes do not perform any filtering, just input -> queue.

The Dockerfile:

FROM logstash:2.3.1

ENV SERVICE_NAME=logstash
CMD ["--allow-env", "-f", "/opt/config"]

COPY ./config/shipper /opt/config

Unfortunately, the error is not very helpful. I am assuming it is BatchRequestTooLong, but that is just a guess. For now, I will disable batch sending.

There's nothing in the ES logs?

@magnusbaeck our setup resembles the last diagram on the Deploying Scaling Logstash guide https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html

The issue occurs in the first set of Logstash instances responsible for outputing logs to the queue (SQS), and because of that we do not receive any logs in ES.