FileBeat EOF Error


(Steffen Siering) #6

which logstash version are you testing with?

logstash beats input plugin has a config named congestion_threshold by default set to 5 seconds. If logstash pipeline can not process data for client X within this duration, the connection to client X will be closed + the circuit-breaker will not accept connections for some time. Having multiple beats i recommend to disable the threshold by setting it to multiple days. The new java rewrite should have the circuit-breaker being removed.


(Michele) #7

I'm working with LS alpha 5 and 2.4 (last stable) both gave the same error, but alpha allowed me to connect more machines

I tried with 3600, I am still getting the same error, plus now I find i/o timeout on FB and LS print this error

{:timestamp=>"2016-09-20T15:33:34.077000+0200", :message=>"PeriodicPoller: exception", :poller=>#<LogStash::Instrument::PeriodicPoller::JVM:0x15e1af9f @task=#<Concurrent::TimerTask:0x45e557e2 @observers=#<Concurrent::Collection::CopyOnNotifyObserverSet:0x5f4f22c8 @observers={#<LogStash::Instrument::PeriodicPoller::JVM:0x15e1af9f ...>=>:update}>, @timeout_interval=60.0, @running=#<Concurrent::AtomicBoolean:0x14357ba2>, @StoppedEvent=#<Concurrent::Event:0x7b806481 @set=false, @iteration=0>, @execution_interval=1.0, @do_nothing_on_deref=true, @run_now=nil, @freeze_on_deref=nil, @executor=#<Concurrent::SafeTaskExecutor:0x5d7f405b @task=#<Proc:0x2359d488@/usr/share/logstash/logstash-core/lib/logstash/instrument/periodic_poller/base.rb:52>, @exception_class=StandardError>, @StopEvent=#<Concurrent::Event:0x1c0e432a @set=false, @iteration=0>, @value=true, @copy_on_deref=nil, @dup_on_deref=nil>, @peak_threads=139, @peak_open_fds=248, @metric=#<LogStash::Instrument::Metric:0x64528846 @collector=#<LogStash::Instrument::Collector:0x1ca06c48 @agent=nil, @metric_store=#<LogStash::Instrument::MetricStore:0x4388710f @store=#<Concurrent::Map:0x2e181062 @default_proc=nil>, @structured_lookup_mutex=#<Mutex:0x40611198>, @fast_lookup=#<Concurrent::Map:0x3001b16a @default_proc=nil>>, @observer_state=true, @snapshot_task=#<Concurrent::TimerTask:0x3001be6c @observers=#<Concurrent::Collection::CopyOnNotifyObserverSet:0x5e74230e @observers={#<LogStash::Instrument::Collector:0x1ca06c48 ...>=>:update}>, @timeout_interval=600.0, @running=#<Concurrent::AtomicBoolean:0x2b667dd8>, @StoppedEvent=#<Concurrent::Event:0x16a94406 @set=false, @iteration=0>, @execution_interval=1.0, @do_nothing_on_deref=true, @run_now=nil, @freeze_on_deref=nil, @executor=#<Concurrent::SafeTaskExecutor:0x69ba698e @task=#<Proc:0x7751cd7c@/usr/share/logstash/logstash-core/lib/logstash/instrument/collector.rb:87>, @exception_class=StandardError>, @StopEvent=#<Concurrent::Event:0x6f41e6f3 @set=false, @iteration=0>, @value=false, @copy_on_deref=nil, @dup_on_deref=nil>>>, @options={:polling_interval=>1, :polling_timeout=>60}>, :result=>nil, :exception=>#<Concurrent::TimeoutError: Concurrent::TimeoutError>, :executed_at=>2016-09-20 15:33:34 +0200, :level=>:error}

(Michele) #8

tried with 2.4 and 5, both with the congestion_threshold: same errors (EOF in FB and PeriodicPoller in LS -> not related between them EOF is very frequent periodipoller is rare)


(Steffen Siering) #9

Hmmm... without network trace or logs possibly giving any hints who is closing the connection, I can't really tell who is closing it. Seeing EOF in beats is a sign of LS closing the connection, potentially due to LS being blocked or overloaded. Maybe I can get some input from LS devs.


(Michele) #10

in my case LS is not writing much in logs the last message I get is "pipeline main started" with the exception of some very rare other message (in which case I get the error). EOF instead is everything but rare

this is the last log of LS now

{:timestamp=>"2016-09-22T14:59:32.705000+0200", :message=>"Pipeline main started"}
{:timestamp=>"2016-09-22T15:00:24.072000+0200", :message=>"Elapsed, 'end event' received", :end_tag=>"log1_start_finished", :unique_id_field=>"thread_id", :level=>:info}
{:timestamp=>"2016-09-22T15:00:31.200000+0200", :message=>"Elapsed, 'start event' received", :start_tag=>"log1_start_called", :unique_id_field=>"thread_id", :level=>:info}
{:timestamp=>"2016-09-22T15:00:34.041000+0200", :message=>"Elapsed, 'end event' received", :end_tag=>"log1_start_finished", :unique_id_field=>"thread_id", :level=>:info}
{:timestamp=>"2016-09-22T15:01:02.498000+0200", :message=>"Elapsed, 'start event' received", :start_tag=>"log1_start_called", :unique_id_field=>"thread_id", :level=>:info}

this is -an extract- of my FB...it's much longer

...
2016-09-22T15:02:53+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:02:53+02:00 INFO Error publishing events (retrying): EOF
2016-09-22T15:02:56+02:00 INFO Non-zero metrics in the last 30s: registrar.states.update=17552 libbeat.logstash.published_but_not_acked_events=744 publish.events=17552 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=339206 libbeat.logstash.publish.read_bytes=30 libbeat.publisher.published_events=9254 libbeat.logstash.call_count.PublishEvents=4 libbeat.logstash.published_and_acked_events=10488 registrar.writes=3
...
2016-09-22T15:04:56+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=24 libbeat.logstash.published_and_acked_events=12719 registrar.states.update=19813 filebeat.harvester.started=2 libbeat.logstash.call_count.PublishEvents=4 filebeat.harvester.running=2 registrar.writes=3 libbeat.publisher.published_events=12719 libbeat.logstash.publish.write_bytes=452110 filebeat.harvester.open_files=2 publish.events=19813
2016-09-22T15:04:56+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:04:56+02:00 INFO Error publishing events (retrying): EOF
2016-09-22T15:05:26+02:00 INFO Non-zero metrics in the last 30s: registrar.writes=3 libbeat.logstash.publish.read_bytes=48 libbeat.logstash.published_but_not_acked_events=1749 registrar.states.update=10793 libbeat.logstash.call_count.PublishEvents=5 libbeat.logstash.published_and_acked_events=10567 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.write_bytes=465925 libbeat.publisher.published_events=10567 publish.events=10793 registar.states.current=2
...
2016-09-22T15:07:26+02:00 INFO Non-zero metrics in the last 30s: registrar.writes=3 filebeat.harvester.started=1 publish.events=9254 registrar.states.update=9254 libbeat.logstash.publish.write_bytes=85514 libbeat.logstash.published_and_acked_events=1758 filebeat.harvester.closed=1 libbeat.logstash.publish.read_bytes=18 libbeat.logstash.call_count.PublishEvents=3 libbeat.publisher.published_events=1758
2016-09-22T15:07:26+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:07:26+02:00 INFO Error publishing events (retrying): EOF
2016-09-22T15:07:56+02:00 INFO Non-zero metrics in the last 30s: libbeat.logstash.publish.read_bytes=24 libbeat.publisher.published_events=442 libbeat.logstash.published_and_acked_events=442 registrar.states.update=503 libbeat.logstash.publish.read_errors=1 libbeat.logstash.published_but_not_acked_events=7 libbeat.logstash.call_count.PublishEvents=4 publish.events=503 registrar.writes=3 libbeat.logstash.publish.write_bytes=20048
...
2016-09-22T15:10:26+02:00 INFO Non-zero metrics in the last 30s: filebeat.harvester.running=1 libbeat.logstash.call_count.PublishEvents=3 libbeat.publisher.published_events=815 libbeat.logstash.publish.write_bytes=35454 registrar.writes=3 libbeat.logstash.published_and_acked_events=815 publish.events=1385 libbeat.logstash.publish.read_bytes=18 registrar.states.update=1385 filebeat.harvester.open_files=1 filebeat.harvester.started=1
2016-09-22T15:10:35+02:00 ERR Failed to publish events caused by: EOF
2016-09-22T15:10:35+02:00 INFO Error publishing events (retrying): EOF
...

(Pier Hugues Pellerin) #11

I've done some investigation here, and the EOF is due to how we handle idle writer, in this case, the writer is FB. This feature was used to close the idle or latent connection after a delay the default is 15s; this correspond to the delay you see in your log between each event.

But there is a catch if logstash takes to much time to process a batch a client timeout could occur because the input would not read from the socket. It might be related to the size of the batch you currently have configured.

I'll create a PR to fix that problem to make sure that if a timeout occur and logstash is still processing a batch it won't close the connection until the fix is released you can bump the client_inactivity_timeout to a bigger value like 900 (15 minutes)

No events should be lost in that case since FB will retransmit the complete window, but it can slow down the consumption of events.

A few notes I saw in this issues.

  • I would not change the pipeline.batch.size, the default is 125, and we have found that is a sweet spot in performance. Using a value of 15000 for the batch mean that you will use a lot more memory than needed.
  • congestion_threshold is deprecated, when you start logstash you should see a warning about that.

(Pier Hugues Pellerin) #12

@il.bert I've created this https://github.com/logstash-plugins/logstash-input-beats/issues/141 to track it.


(Michele) #13

thanks a lot for clarification Pier, i will wait for the solution, any ideas when it will be out?
In the mean time i will try with clinet_inactivity_timeout

actually was Steffen suggesting to use congestion_threshold up there, but it did not solved the problem

I used such a large value because right now my log messages are tremendously small, I will try with 125 if it works better, thanks


(Pier Hugues Pellerin) #14

@il.bert I would stay with the defaults as much you can, we do our best to provide a good experience out of the box.

The number of workers depends on many factors, what kind of works you are doing and what machine you have, you have to do some experimentation to find the appropriate number for you.

I will try to release a fix really soon, since its a plugin you will be able to update it without updating the whole Logstash.


(Michele) #15

with a value of 125 I am having a very small disk throughput (<1MB) in elastic nodes, instead moving to larger batch (15k) allowed me to have a throughput of 10MB on average with peak of 100MB
my log lines are very small, I read in the guide that is preferred to have bulk of 5-15MB that's why I extended it :slight_smile:

btw thanks I wait for the update


(Pier Hugues Pellerin) #16

@il.bert you are right that bulk size of 5-15mb is ideal, we have work planned to optimize the es output to use the event size to optimize the bulk request.

But even if you do 15k, the ES output will uses a max of 500 items for each bulk request. So if you have really tiny logs, it make sense in you case to adjust the batch size but you may also want to adjust the flush_size of the elasticsearch output see https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-flush_size


(Michele) #17

my flush size is 30k :wink: I tried to create pipeline from the default 125 to the top 100k
btw now I changed to 1250, I have an average disk throughput that is little better (around 5MB) I will try put more load with this one, but definitely 125 is too low


Missing a lot of logs sent via filebeats
(Pier Hugues Pellerin) #18

Perfect, I was missing your logstash config :wink:


#19

I am seeing a similar error. I am using the latest filebeat-linux-arm build on a raspberry pi.

2016/09/22 18:50:56.453871 output.go:109: DBG  output worker: publish 156 events
2016/09/22 18:50:56.453981 sync.go:53: DBG  connect
2016/09/22 18:50:56.456413 sync.go:107: DBG  Try to publish 156 events to logstash with window size 10
2016/09/22 18:50:56.461644 client.go:194: DBG  handle error: EOF
2016/09/22 18:50:56.461826 client.go:110: DBG  closing
2016/09/22 18:50:56.462080 sync.go:78: DBG  0 events out of 156 events sent to logstash. Continue sending
2016/09/22 18:50:56.462213 sync.go:58: DBG  close connection
2016/09/22 18:50:56.462393 sync.go:85: ERR Failed to publish events caused by: EOF
2016/09/22 18:50:56.462511 single.go:91: INFO Error publishing events (retrying): EOF
2016/09/22 18:50:56.462605 sync.go:58: DBG  close connection

I don't think logstash is to blame because it succeeds with an older build: filebeat version 1.2.0-SNAPSHOT (arm)


(Steffen Siering) #20

@pixelrebel please start another discussion. Add in details like logstash logs and versions being used. Is the EOF happen only from time to time or right from the beginning.


(Maxwell Flanders) #21

Hey guys, I found the root of our issue was actually on a misconfigured filter on the logstash end that was clogging the filter workers queue. I guess that manifested itself eventually in filebeats's logs.

Edit: Posted this in wrong thread. Copying it to the correct thread that I started: Missing a lot of logs sent via filebeats


(Steffen Siering) #22

It's a combination of filters/outputs in logstash slowing down inputs + beats input closing connection if this happens.

Related github issue for most recent LS beats input plugin: https://github.com/logstash-plugins/logstash-input-beats/issues/141

Workarounds: older versions require you to increase congestion_threshold, newer versions require you to increase client_inactivity_timeout. In either case it's a good idea trying to figure out the reason for the slowdowns in LS, like some expensive filters.


(Maxwell Flanders) #23

Yeah mine was a case where we had a DNS filter with the wrongs dns ip's in it, so it was doomed to utter failure. However, in cases where honest throughput is just so heavy that it slows you these tools could still definitely help. Would increasing logstash worker number or increasing filebeat worker numbers be expected to help in a situation like that??


(Steffen Siering) #24

there are a number of options in filebeat to increase throughput (some at the cost of increased memory usage).

e.g. filebeat.publish_async and output.logstash.pipelined. Adding workers + enabling load balancing can increase throughput as well (depends on mix of workers, filebeat.spool_size and output.logstash.bulk_max_size).


(Michele) #25

Hi, sadly our test did not proceed well, none of the above workaround worked 100%, we were still getting that error.
As we had the opportunity, we moved to a larger cluster: 3 HUGE machines. We expected one of them to be able to take the whole load but we used 3 for test.

It happened again. EOF

Thus it could not be a problem of fully loaded machine so we moved to changing other parameters.
We found out that removing load_balance in filebeat solved the problem, or at least it seems to.

Again using only 1 single logstash output in filebeat works good (we split our source in 3 group and each group goes to one LS machine), now we have this configuration from 24h and it's working pretty well.

This is not a good solution for us, as we have a cluster of machines we wants it to be working efficiently and reliably.
Do you have any idea on why this is happening? how may we solve it?

thanks


Missing a lot of logs sent via filebeats