FileBeat EOF Error

il.bert · September 19, 2016, 12:38pm

Hi guys,
I am running a ELK on a cluster of machine, 3 ES, 2-4LS and 12 FB

I have a problem with Filebeat: sometimes it write this error in the log

2016-09-19T14:31:21+02:00 ERR Failed to publish events caused by: EOF
2016-09-19T14:31:21+02:00 INFO Error publishing events (retrying): EOF

the frequency of errors depends

on the number of log I have to collect
on the configuration of LS (number of nodes, workers and size of batch)
on the configuration o FB (batch size and number of LS)

i am using FB v5 alpha, LS stable and LS v5 alpha, ES stable
I tryied changing config but i still have same errors

LS does not write anything in the log, once in a while I get
ERROR org.logstash.beats.BeatsHandler - Exception: Connection reset by peer

But it is not in the same time of the error in FB

steffens · September 19, 2016, 1:20pm

Have you got some other device/firewall in between? Someone is closing the underlying TCP connection.

il.bert · September 19, 2016, 2:09pm

Hi Steffen, it sound a little strange to me as I get it only once in a while (at most once per minute, on average once every about 5 minute)
I thought a firewall would be blocking everything, am i wrong? can it be something different?

steffens · September 19, 2016, 7:09pm

this happening once in a while sounds much different from original post. Can you also share filebeat output settings and logstash input settings?

Some firewalls/proxies/NAT-devices do close connections if there was no data exchange for some time X. That's why I'm asking.

il.bert · September 20, 2016, 11:41am

I checked twice with my domain admin, actually there is no firewall between machines, moreover I am pretty sure there is a continuous data exchange (I can see it from kibana too)

just as example this morning I connected all the logger machines, after the 6th I started getting some errors. Whene LS was warmed only one machine out of 10 continued returning that error every about 15s
I closed everything and changed the config, but nothing changed

yesterday with a smaller LS cluster I arrived at 6 working machines

I improved my LS cluster, now it is using 2 machine
LS yml is the following

path.data: /var/lib/logstash
pipeline.workers: 24
pipeline.output.workers: 1
pipeline.batch.size: 15000
path.config: /etc/logstash/conf.d
config.reload.automatic: true
config.reload.interval: 30
log.level: verbose
path.log: /var/log/logstash/logstash.log

jvm.option is the standard one, I only changed

-Xms12g
-Xmx12g

FB conf is the following

filebeat.prospectors:
 - input_type: log

  paths:
    - d:Logger\1
  exclude_lines: ["^\\s*$"]
  include_lines: ["^([a-z]|[A-Z])"]
  fields:
    catena: 11
  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m
  document_type: log1
  
- input_type: log

  paths:
    - d:Logger\2
  exclude_lines: ["^\\s*$"]
  include_lines: ["^([a-z]|[A-Z])"]
  fields:
    catena: 11
  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m
  document_type: log2
  
- input_type: log

  paths:
    - d:Logger\3
  exclude_lines: ["^\\s*$"]
  include_lines: ["^([a-z]|[A-Z])"]
  fields:
    catena: 11
  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m
  document_type: log3
  
- input_type: log

  paths:
    - d:Logger\4
  exclude_lines: ["^\\s*$"]
  include_lines: ["^([a-z]|[A-Z])"]
  fields:
    catena: 11
  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m
  document_type: log4
  
- input_type: log

  paths:
    - d:Logger\5
  exclude_lines: ["^\\s*$"]
  include_lines: ["^([a-z]|[A-Z])"]
  fields:
    catena: 11
  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m
  document_type: log5
  
- input_type: log

  paths:
    - d:Logger\6
  exclude_lines: ["^\\s*$"]
  include_lines: ["^([a-z]|[A-Z])"]
  fields:
    catena: 11
  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m
  document_type: log6
  
- input_type: log

  paths:
    - d:Logger\7
  exclude_lines: ["^\\s*$"]
  include_lines: ["^([a-z]|[A-Z])"]
  fields:
    catena: 11
  fields_under_root: true
  ignore_older: 10m
  close_inactive: 2m
  clean_inactive: 15m
  document_type: log7
  
#========================= Filebeat global options ============================

filebeat.spool_size: 10000
filebeat.idle_timeout: 10s

#----------------------------- Logstash output --------------------------------
output.logstash:
  hosts: ["10.246.85.242:5044", "10.246.85.243:5044"]
  template.name: "filebeat"
  loadbalance: true
  bulk_max_size: 1000  
  template.path: "filebeat.template.json"
  template.overwrite: false

steffens · September 20, 2016, 12:43pm

which logstash version are you testing with?

logstash beats input plugin has a config named congestion_threshold by default set to 5 seconds. If logstash pipeline can not process data for client X within this duration, the connection to client X will be closed + the circuit-breaker will not accept connections for some time. Having multiple beats i recommend to disable the threshold by setting it to multiple days. The new java rewrite should have the circuit-breaker being removed.

il.bert · September 20, 2016, 1:41pm

I'm working with LS alpha 5 and 2.4 (last stable) both gave the same error, but alpha allowed me to connect more machines

I tried with 3600, I am still getting the same error, plus now I find i/o timeout on FB and LS print this error

{:timestamp=>"2016-09-20T15:33:34.077000+0200", :message=>"PeriodicPoller: exception", :poller=>#<LogStash::Instrument::PeriodicPoller::JVM:0x15e1af9f @task=#<Concurrent::TimerTask:0x45e557e2 @observers=#<Concurrent::Collection::CopyOnNotifyObserverSet:0x5f4f22c8 @observers={#<LogStash::Instrument::PeriodicPoller::JVM:0x15e1af9f ...>=>:update}>, @timeout_interval=60.0, @running=#<Concurrent::AtomicBoolean:0x14357ba2>, @StoppedEvent=#<Concurrent::Event:0x7b806481 @set=false, @iteration=0>, @execution_interval=1.0, @do_nothing_on_deref=true, @run_now=nil, @freeze_on_deref=nil, @executor=#<Concurrent::SafeTaskExecutor:0x5d7f405b @task=#<Proc:0x2359d488@/usr/share/logstash/logstash-core/lib/logstash/instrument/periodic_poller/base.rb:52>, @exception_class=StandardError>, @StopEvent=#<Concurrent::Event:0x1c0e432a @set=false, @iteration=0>, @value=true, @copy_on_deref=nil, @dup_on_deref=nil>, @peak_threads=139, @peak_open_fds=248, @metric=#<LogStash::Instrument::Metric:0x64528846 @collector=#<LogStash::Instrument::Collector:0x1ca06c48 @agent=nil, @metric_store=#<LogStash::Instrument::MetricStore:0x4388710f @store=#<Concurrent::Map:0x2e181062 @default_proc=nil>, @structured_lookup_mutex=#<Mutex:0x40611198>, @fast_lookup=#<Concurrent::Map:0x3001b16a @default_proc=nil>>, @observer_state=true, @snapshot_task=#<Concurrent::TimerTask:0x3001be6c @observers=#<Concurrent::Collection::CopyOnNotifyObserverSet:0x5e74230e @observers={#<LogStash::Instrument::Collector:0x1ca06c48 ...>=>:update}>, @timeout_interval=600.0, @running=#<Concurrent::AtomicBoolean:0x2b667dd8>, @StoppedEvent=#<Concurrent::Event:0x16a94406 @set=false, @iteration=0>, @execution_interval=1.0, @do_nothing_on_deref=true, @run_now=nil, @freeze_on_deref=nil, @executor=#<Concurrent::SafeTaskExecutor:0x69ba698e @task=#<Proc:0x7751cd7c@/usr/share/logstash/logstash-core/lib/logstash/instrument/collector.rb:87>, @exception_class=StandardError>, @StopEvent=#<Concurrent::Event:0x6f41e6f3 @set=false, @iteration=0>, @value=false, @copy_on_deref=nil, @dup_on_deref=nil>>>, @options={:polling_interval=>1, :polling_timeout=>60}>, :result=>nil, :exception=>#<Concurrent::TimeoutError: Concurrent::TimeoutError>, :executed_at=>2016-09-20 15:33:34 +0200, :level=>:error}

il.bert · September 22, 2016, 8:09am

tried with 2.4 and 5, both with the congestion_threshold: same errors (EOF in FB and PeriodicPoller in LS -> not related between them EOF is very frequent periodipoller is rare)

steffens · September 22, 2016, 1:10pm

Hmmm... without network trace or logs possibly giving any hints who is closing the connection, I can't really tell who is closing it. Seeing EOF in beats is a sign of LS closing the connection, potentially due to LS being blocked or overloaded. Maybe I can get some input from LS devs.

il.bert · September 22, 2016, 1:29pm

in my case LS is not writing much in logs the last message I get is "pipeline main started" with the exception of some very rare other message (in which case I get the error). EOF instead is everything but rare

this is the last log of LS now

{:timestamp=>"2016-09-22T14:59:32.705000+0200", :message=>"Pipeline main started"}
{:timestamp=>"2016-09-22T15:00:24.072000+0200", :message=>"Elapsed, 'end event' received", :end_tag=>"log1_start_finished", :unique_id_field=>"thread_id", :level=>:info}
{:timestamp=>"2016-09-22T15:00:31.200000+0200", :message=>"Elapsed, 'start event' received", :start_tag=>"log1_start_called", :unique_id_field=>"thread_id", :level=>:info}
{:timestamp=>"2016-09-22T15:00:34.041000+0200", :message=>"Elapsed, 'end event' received", :end_tag=>"log1_start_finished", :unique_id_field=>"thread_id", :level=>:info}
{:timestamp=>"2016-09-22T15:01:02.498000+0200", :message=>"Elapsed, 'start event' received", :start_tag=>"log1_start_called", :unique_id_field=>"thread_id", :level=>:info}

this is -an extract- of my FB...it's much longer

...
2016-09-22T15:02:53+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:02:53+02:00 INFO Error publishing events (retrying): EOF
2016-09-22T15:02:56+02:00 INFO Non-zero metrics in the last 30s: registrar.states.update=17552 libbeat.logstash.published_but_not_acked_events=744 publish.events=17552 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=339206 libbeat.logstash.publish.read_bytes=30 libbeat.publisher.published_events=9254 libbeat.logstash.call_count.PublishEvents=4 libbeat.logstash.published_and_acked_events=10488 registrar.writes=3
...
2016-09-22T15:04:56+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=24 libbeat.logstash.published_and_acked_events=12719 registrar.states.update=19813 filebeat.harvester.started=2 libbeat.logstash.call_count.PublishEvents=4 filebeat.harvester.running=2 registrar.writes=3 libbeat.publisher.published_events=12719 libbeat.logstash.publish.write_bytes=452110 filebeat.harvester.open_files=2 publish.events=19813
2016-09-22T15:04:56+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:04:56+02:00 INFO Error publishing events (retrying): EOF
2016-09-22T15:05:26+02:00 INFO Non-zero metrics in the last 30s: registrar.writes=3 libbeat.logstash.publish.read_bytes=48 libbeat.logstash.published_but_not_acked_events=1749 registrar.states.update=10793 libbeat.logstash.call_count.PublishEvents=5 libbeat.logstash.published_and_acked_events=10567 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=465925 libbeat.publisher.published_events=10567 publish.events=10793 registar.states.current=2
...
2016-09-22T15:07:26+02:00 INFO Non-zero metrics in the last 30s: registrar.writes=3 filebeat.harvester.started=1 publish.events=9254 registrar.states.update=9254 libbeat.logstash.publish.write_bytes=85514 libbeat.logstash.published_and_acked_events=1758 filebeat.harvester.closed=1 libbeat.logstash.publish.read_bytes=18 libbeat.logstash.call_count.PublishEvents=3 libbeat.publisher.published_events=1758
2016-09-22T15:07:26+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:07:26+02:00 INFO Error publishing events (retrying): EOF
2016-09-22T15:07:56+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=24 libbeat.publisher.published_events=442 libbeat.logstash.published_and_acked_events=442 registrar.states.update=503 libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=7 libbeat.logstash.call_count.PublishEvents=4 publish.events=503 registrar.writes=3 libbeat.logstash.publish.write_bytes=20048
...
2016-09-22T15:10:26+02:00 INFO Non-zero metrics in the last 30s: filebeat.harvester.running=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.publisher.published_events=815 libbeat.logstash.publish.write_bytes=35454 registrar.writes=3 libbeat.logstash.published_and_acked_events=815 publish.events=1385 libbeat.logstash.publish.read_bytes=18 registrar.states.update=1385 filebeat.harvester.open_files=1 filebeat.harvester.started=1
2016-09-22T15:10:35+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:10:35+02:00 INFO Error publishing events (retrying): EOF
...

pierhugues · September 22, 2016, 1:44pm

I've done some investigation here, and the EOF is due to how we handle idle writer, in this case, the writer is FB. This feature was used to close the idle or latent connection after a delay the default is 15s; this correspond to the delay you see in your log between each event.

But there is a catch if logstash takes to much time to process a batch a client timeout could occur because the input would not read from the socket. It might be related to the size of the batch you currently have configured.

I'll create a PR to fix that problem to make sure that if a timeout occur and logstash is still processing a batch it won't close the connection until the fix is released you can bump the client_inactivity_timeout to a bigger value like 900 (15 minutes)

No events should be lost in that case since FB will retransmit the complete window, but it can slow down the consumption of events.

A few notes I saw in this issues.

I would not change the pipeline.batch.size, the default is 125, and we have found that is a sweet spot in performance. Using a value of 15000 for the batch mean that you will use a lot more memory than needed.
congestion_threshold is deprecated, when you start logstash you should see a warning about that.

pierhugues · September 22, 2016, 1:49pm

@il.bert I've created this https://github.com/logstash-plugins/logstash-input-beats/issues/141 to track it.

il.bert · September 22, 2016, 1:50pm

thanks a lot for clarification Pier, i will wait for the solution, any ideas when it will be out?
In the mean time i will try with clinet_inactivity_timeout

actually was Steffen suggesting to use congestion_threshold up there, but it did not solved the problem

I used such a large value because right now my log messages are tremendously small, I will try with 125 if it works better, thanks

pierhugues · September 22, 2016, 1:54pm

@il.bert I would stay with the defaults as much you can, we do our best to provide a good experience out of the box.

The number of workers depends on many factors, what kind of works you are doing and what machine you have, you have to do some experimentation to find the appropriate number for you.

I will try to release a fix really soon, since its a plugin you will be able to update it without updating the whole Logstash.

il.bert · September 22, 2016, 2:17pm

with a value of 125 I am having a very small disk throughput (<1MB) in elastic nodes, instead moving to larger batch (15k) allowed me to have a throughput of 10MB on average with peak of 100MB
my log lines are very small, I read in the guide that is preferred to have bulk of 5-15MB that's why I extended it

btw thanks I wait for the update

pierhugues · September 22, 2016, 2:37pm

@il.bert you are right that bulk size of 5-15mb is ideal, we have work planned to optimize the es output to use the event size to optimize the bulk request.

But even if you do 15k, the ES output will uses a max of 500 items for each bulk request. So if you have really tiny logs, it make sense in you case to adjust the batch size but you may also want to adjust the flush_size of the elasticsearch output see https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-flush_size

il.bert · September 22, 2016, 2:42pm

my flush size is 30k I tried to create pipeline from the default 125 to the top 100k
btw now I changed to 1250, I have an average disk throughput that is little better (around 5MB) I will try put more load with this one, but definitely 125 is too low

pierhugues · September 22, 2016, 2:47pm

Perfect, I was missing your logstash config

pixelrebel · September 22, 2016, 6:56pm

I am seeing a similar error. I am using the latest filebeat-linux-arm build on a raspberry pi.

2016/09/22 18:50:56.453871 output.go:109: DBG  output worker: publish 156 events
2016/09/22 18:50:56.453981 sync.go:53: DBG  connect
2016/09/22 18:50:56.456413 sync.go:107: DBG  Try to publish 156 events to logstash with window size 10
2016/09/22 18:50:56.461644 client.go:194: DBG  handle error: EOF
2016/09/22 18:50:56.461826 client.go:110: DBG  closing
2016/09/22 18:50:56.462080 sync.go:78: DBG  0 events out of 156 events sent to logstash. Continue sending
2016/09/22 18:50:56.462213 sync.go:58: DBG  close connection
2016/09/22 18:50:56.462393 sync.go:85: ERR Failed to publish events caused by: EOF
2016/09/22 18:50:56.462511 single.go:91: INFO Error publishing events (retrying): EOF
2016/09/22 18:50:56.462605 sync.go:58: DBG  close connection

I don't think logstash is to blame because it succeeds with an older build: filebeat version 1.2.0-SNAPSHOT (arm)

steffens · September 23, 2016, 11:10am

@pixelrebel please start another discussion. Add in details like logstash logs and versions being used. Is the EOF happen only from time to time or right from the beginning.