Intermittent timeouts (Doesn't appear to stop service)

Hello, so we are starting a project where we are moving to ELK 7, with the Elasticsearch and Kibana being offered by elastic.co rather than self hosted.

In our ELK6 install, I had a Nginx loadbalancer balancing our logstash nodes, this time round i've opted to use a AWS Network load balancer.

Despite logs reaching Logstash absolutely fine, in the Filebeat logs I do see this, usually every minute.

|2020-02-06T09:31:25.235Z|ERROR|logstash/async.go:256|Failed to publish events caused by: read tcp 10.37.1.229:35290->10.16.2.213:5044: i/o timeout|
|---|---|---|---|
|2020-02-06T09:31:25.238Z|ERROR|logstash/async.go:256|Failed to publish events caused by: client is not connected|
|2020-02-06T09:31:26.762Z|ERROR|pipeline/output.go:121|Failed to publish events: client is not connected|
|2020-02-06T09:31:56.937Z|ERROR|logstash/async.go:256|Failed to publish events caused by: read tcp 10.37.1.229:35478->10.16.2.213:5044: i/o timeout|
|2020-02-06T09:31:56.938Z|ERROR|logstash/async.go:256|Failed to publish events caused by: client is not connected|
|2020-02-06T09:31:58.935Z|ERROR|pipeline/output.go:121|Failed to publish events: client is not connected|

I don't have any pipeline configuration at present and just using the tradition conf.d/01-input for my beats config, this is below:

input {
  beats {
    port => 5044
    ssl => true
    ssl_certificate => "/etc/pki/logstash/globalsign_x.crt"
    ssl_certificate_authorities => "/etc/pki/logstash/globalsign_x.crt"
    ssl_key => "/etc/pki/logstash/globalsign_x.key"
    ssl_verify_mode => "peer"
  }
}

And the connection settings for filebeat are:

output.logstash:
  # The Logstash hosts
  hosts: ["logstash-balancer.x:5044"]


  # Optional SSL. By default is off.

  ssl.enabled: true
  # List of root certificates for HTTPS server verifications
  ssl.certificate_authorities: ["/etc/pki/filebeat/globalsign_x.crt"]

  # Certificate for SSL client authentication
  ssl.certificate: "/etc/pki/filebeat/globalsign_x.crt"

  # Client Certificate Key
  ssl.key: "/etc/pki/filebeat/globalsign_x.key"

  ttl: 90s

I added the ttl as an experiment as I know the default timeout (can't be changed) with a NLB is 350seconds but it doesn't appeared to have made a difference.

Any idea how I can debug this or am I missing something painfully obvious?

I do sometimes see this on the logstash side, but it doesn't match up with the timeouts on the filebeat side

[2020-02-06T09:49:12,655][INFO ][org.logstash.beats.BeatsHandler] [main] [local: 0.0.0.0:5044, remote: 10.37.1.135:50616] Handling exception: (NoMethodError) undefined method accept for nil:NilClass


[2020-02-06T09:21:43,377][WARN ][io.netty.channel.DefaultChannelPipeline][main] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
    org.jruby.exceptions.NoMethodError: (NoMethodError) undefined method `accept' for nil:NilClass
    	at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_input_minus_beats_minus_6_dot_0_dot_5_minus_java.lib.logstash.inputs.beats.message_listener.onNewMessage(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-beats-6.0.5-java/lib/logstash/inputs/beats/message_listener.rb:44) ~[?:?]
    [2020-02-06T09:21:43,377][WARN ][io.netty.channel.DefaultChannelPipeline][main] An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
    org.jruby.exceptions.NoMethodError: (NoMethodError) undefined method `accept' for nil:NilClass
    	at usr.share.logstash.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_input_minus_beats_minus_6_dot_0_dot_5_minus_java.lib.logstash.inputs.beats.message_listener.onNewMessage(/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-input-beats-6.0.5-java/lib/logstash/inputs/beats/message_listener.rb:44) ~[?:?]

This seems to be have the same issue:

Using a load balancer with Elastic Cloud is a bit trickier than it looks from the outside: The endpoint that you are using is actually a proxy (rerouting traffic when upgrading or scaling up / down, possibility for IP filtering...). This proxy is actually looking for the header X-Found-Cluster to redirect your the requests to the right cluster, which you can copy from the cloud admin UI. If you haven't set that header, your traffic will not reach the right cluster.

For example this is my sample configuration with nginx:

location / {
    proxy_pass       {{ kibana_host }};
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Found-Cluster {{ kibana_id }};
}

This might not be possible with NLB, since I think that one is only working on the network layer (TCP / UDP / TLS), but not the application layer (HTTP).

The question is, why do you even need that load balancer? General best practice for your setup, easier configuragion, custom domain / certificate,...?

Just to clarify, I am using a NLB to load balance Logstash instances, not Kibana.

I want SSL from Filebeat > Logstash and I want resilience of multiple availability zones in AWS to match the Elasticsearch deployment.

So I am not actually using a load balancer to interact with anything on the Elastic cloud side of things. The only thing that interacts with Elastic Cloud is the elastic output operator running from Logstash.

Ah my bad, I thought this was from Logstash to Elastic Cloud.

I've never tried to set up Beats to Logstash through an NLB, but the question remains: Why? Beats will do load balancing to Logstash (or pick a random one if you configure it), but will always wait for an ACK and pick a different Logstash node if one isn't available any more.

Less configuration on the logstash hosts them selves around SSL. A DNS Entry is all that's required for SSL passthrough.

We always want our logstash nodes to be private, for stuff outside of our AWS environment they can hit the LB and nothing else (improved security).

It's worth noting the sheer amount of threads in this board as of lately with the same issue. I have tried it without a balancer and had the same timeout errors, so currently it appears to be innocent.

Adding a timeout to filebeat seemed to address it, but I don't fancy reconfiguring every host:

timeout: 90

So I am going to try to add this to logstash instead:

client_inactivity_timeout => 120

This is resolved by client inactivity timeout. A bit of an odd error to receive when it's just "standing by" however both the above config changes work. I went with the logstash client inactivity as my preferred outcome.

After spending a few days on this, I have found this is due to a bug in the Beats input of Logstash: https://github.com/elastic/logstash/issues/11540

I resolved it by upgrading my ELK stack to 7.6.0 but you can just upgrade the plugin in Logstash if required by following the command in the above git issue.