Question about the idle_timeout option


#1

hi,
reading the documentation, i was thinking that the spooler is flushed every 'idle_timeout' duration or if the spool_size is reached.
In my case (as u can see on my log below), I set these options :

  • spool_size = 4096
  • idle_timeout = 1s

I start filebeat to prospect and harvest a file with more than 50 000 events.
when i read the filebeat log, i see that 4096 events are sent every 11 or 12 seconds ...
look :
2016-05-11T16:26:36+02:00 INFO Starting prospector of type: log
2016-05-11T16:26:36+02:00 INFO Starting spooler: spool_size: 4096; idle_timeout: 1s
2016-05-11T16:26:36+02:00 INFO All prospectors initialised with 0 states to persist
2016-05-11T16:26:36+02:00 INFO Starting Registrar
2016-05-11T16:26:36+02:00 INFO Start sending events to output
2016-05-11T16:26:36+02:00 INFO Harvester started for file: /var/log/logstash/logs_BAD/logs_PMM_PERF/PRDGCE7V3094/LogsPerf.txt.1
2016-05-11T16:26:50+02:00 INFO Events sent: 4096
2016-05-11T16:26:50+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:27:02+02:00 INFO Events sent: 4096
2016-05-11T16:27:02+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:27:13+02:00 INFO Events sent: 4096
2016-05-11T16:27:13+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:27:24+02:00 INFO Events sent: 4096
2016-05-11T16:27:24+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:27:35+02:00 INFO Events sent: 4096
2016-05-11T16:27:35+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:27:46+02:00 INFO Events sent: 4096
2016-05-11T16:27:46+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:27:58+02:00 INFO Events sent: 4096
2016-05-11T16:27:58+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:28:09+02:00 INFO Events sent: 4096
2016-05-11T16:28:09+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:28:21+02:00 INFO Events sent: 4096
2016-05-11T16:28:21+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:28:32+02:00 INFO Events sent: 4096
2016-05-11T16:28:32+02:00 INFO Registry file updated. 1 states written.
2016-05-11T16:28:43+02:00 INFO Events sent: 4096
2016-05-11T16:28:43+02:00 INFO Registry file updated. 1 states written.

why filebeat don't send events every 1s ? it seems like idle_timeout is not used ..
do i miss something ?
ty everybody :slight_smile:


(ruflin) #2

Can you share your config file, your filebeat version and if possible enabled the debug level on logging to get some more information?


#3

ty ruflin for answering.
In fact, i look at the debug-mode log and understand than the sppol-size is filled quickly than the idle_timeout.
so, there is no problem :slight_smile:
I have one more question now :
i read this in the log :
2016-05-17T15:09:13+02:00 DBG Try to publish 2048 events to logstash with window size 10
2016-05-17T15:09:14+02:00 DBG Try to publish 2038 events to logstash with window size 15
2016-05-17T15:09:14+02:00 DBG Try to publish 2023 events to logstash with window size 23
2016-05-17T15:09:14+02:00 DBG Try to publish 2000 events to logstash with window size 35
2016-05-17T15:09:15+02:00 DBG Try to publish 1965 events to logstash with window size 53
2016-05-17T15:09:15+02:00 DBG Try to publish 1912 events to logstash with window size 80
2016-05-17T15:09:16+02:00 DBG Try to publish 1832 events to logstash with window size 120
2016-05-17T15:09:16+02:00 DBG Try to publish 1712 events to logstash with window size 180
2016-05-17T15:09:17+02:00 DBG Try to publish 1532 events to logstash with window size 270
2016-05-17T15:09:18+02:00 DBG Try to publish 1262 events to logstash with window size 405
2016-05-17T15:09:19+02:00 DBG Try to publish 857 events to logstash with window size 608
2016-05-17T15:09:21+02:00 DBG Try to publish 249 events to logstash with window size 857
2016-05-17T15:09:22+02:00 DBG Try to publish 2048 events to logstash with window size 857
2016-05-17T15:09:25+02:00 DBG Try to publish 1191 events to logstash with window size 1024
2016-05-17T15:09:28+02:00 DBG Try to publish 167 events to logstash with window size 1024
2016-05-17T15:09:30+02:00 DBG Try to publish 2048 events to logstash with window size 1024
2016-05-17T15:09:33+02:00 DBG Try to publish 1024 events to logstash with window size 1024
2016-05-17T15:09:36+02:00 DBG Try to publish 2048 events to logstash with window size 1024
2016-05-17T15:09:39+02:00 DBG Try to publish 1024 events to logstash with window size 1024
2016-05-17T15:09:42+02:00 DBG Try to publish 1808 events to logstash with window size 1024
2016-05-17T15:09:45+02:00 DBG Try to publish 784 events to logstash with window size 1024

what is this 'window size' ? I don't find it in the doc reference...


(ruflin) #4

The window size is adjusted dynamically in coordination with logstash. Checking the code the max windows size is based on bulk_max_size (https://github.com/elastic/beats/blob/e20d5ee952e6a04cbdb3033788379d226fb7d46b/libbeat/outputs/logstash/config.go#L14). Seems like it is not in the docs right now. @steffens Should we add this one or was it left out on purpose?

Here you can find some more details on the implementation of Window Size: https://github.com/elastic/beats/blob/e20d5ee952e6a04cbdb3033788379d226fb7d46b/libbeat/outputs/logstash/window.go


(Steffen Siering) #5

it's configured right here: https://www.elastic.co/guide/en/beats/filebeat/current/logstash-output.html#_bulk_max_size_2 . The actual window size can grow up to bulk_max_size.


#6

ty steffens.
it is said that the default value is 2048.
but, in my log, the max value of window_size is 1024 (and i haven't change the bulk_max_size in my config).
documentation is wrong ?
and how can you explain the growing window size at the beginning of the transfert ? from 10 to 1024 ?

ty very much for your reactivity, it's a pleasure :slight_smile:


(Steffen Siering) #7

which filebeat version have you installed? It was raised from 1024 to 2048 at some point (there was a bug having the value to be set to 1024 due to git automatically merging the default config back to 1024 without us noticing). Workaround is to set bulk_max_size: 2048 in your config. Sometimes 3K or 4K might even give better performance. The batches collected via spool_size will be split into sub-batches of size bulk_max_size. Having load balancing enabled, the sub-batches will be load-balanced on logstash workers (default: 1) operating in lock-step for filebeat. If filebeat.publish_async: true there is no need for spliting batches, as batches from spooler will be load-balanced anyway.

One problem is with logstash circuit breaker killing connections if processing batches takes too long. Effect with logstash forwarder was: resend the complete batch and get connection killed due to same batch still taking to much time. If this happens to you (even a good idea in general) is to the congestion_threshold in beats input plugin to some very high value. Beats itself has a connection timeout of 30s (increase as required). Unfortunately logstash can not send a keepalive (it was updated to send partial ACKs, but no real replacements for keepalives) for architectural reasons (we're working on this). Unfortunately batch processing depends on logstash + (even more so) downstream due to logstash blocking on back-pressure from output plugins (e.g. database can not store data fast enough). Due to missing keepalive signal and preventing beats from resending all events over and over again, beats do use a dynamic window (batches are split according to current window size) + slow start. If we find a particular batch size generates downstream problems (disconnects + resends), the window size is reduced again. On startup the default windows size (max active batch size) is 10 events, so we don't run into problems early on.


#8

ok, very complete informations.
my version is 1.2.1 but i'm not in production yet... i consider i'll take the 5.0.0 version but i don't know when it becomes the current version.. have u infos about that ?
ty again steffens


(Steffen Siering) #9

hm, weird 1.2.1 is still using 1024 as default. I created a ticket.

Back to your original question, filebeat (and potentially logstash) will suffer from back-pressure if downstream processing can not process events in time. In this case both services will buffer up some events, but will get blocked by downstream processing (filebeat never drops events). That is, the spooler output queue will block if output pipeline becomes congested. If spooler is blocked, it will batch up new events only after every successful send. Once the window size has reached 1024 events, it already takes between 2 and 3 seconds to get the ACK from logstash. Due to having the spooler_size set to 4096, 4 batches of size 1024 will be forwarded to logstash. With 4*3s=12s this matches your original observation of the spooler being flushed only every 11 to 12 seconds. Consider load-balancing by enabling publish_async or set logstash worker: 4.


#10

something strange steffens.
In 1.2.1, window size (and bulk_max_size) seems to be parameter by default to 1024 but, in the beginning of my log, it is reported that it is 2048. look :
1 2016-05-17T15:09:12+02:00 DBG Disable stderr logging
2 2016-05-17T15:09:12+02:00 DBG Initializing output plugins
3 2016-05-17T15:09:12+02:00 INFO GeoIP disabled: No paths were set under output.geoip.paths
4 2016-05-17T15:09:12+02:00 INFO Max Retries set to: 3
5 2016-05-17T15:09:12+02:00 DBG connect
6 2016-05-17T15:09:12+02:00 INFO Activated logstash as output plugin.
7 2016-05-17T15:09:12+02:00 DBG Create output worker
8 2016-05-17T15:09:12+02:00 DBG No output is defined to store the topology. The server fields might not be filled.
10 2016-05-17T15:09:12+02:00 INFO Flush Interval set to: 1s
11 2016-05-17T15:09:12+02:00 INFO **[quote="steffens, post:9, topic:50002, full:true"]
12 2016-05-17T15:09:12+02:00 DBG create bulk processing worker (interval=1s, bulk size=2048)
13 2016-05-17T15:09:12+02:00 INFO Init Beat: filebeat; Version: 1.2.1
14 2016-05-17T15:09:12+02:00 INFO filebeat sucessfully setup. Start running.
15 2016-05-17T15:09:12+02:00 INFO Registry file set to: /var/lib/filebeat/registry

later, you've this, as i show in my first reply :
65495 2016-05-17T15:09:25+02:00 DBG Try to publish 1191 events to logstash with window size 1024
65501 2016-05-17T15:09:28+02:00 DBG Try to publish 167 events to logstash with window size 1024
95818 2016-05-17T15:09:30+02:00 DBG Try to publish 2048 events to logstash with window size 1024
95828 2016-05-17T15:09:33+02:00 DBG Try to publish 1024 events to logstash with window size 1024
124508 2016-05-17T15:09:36+02:00 DBG Try to publish 2048 events to logstash with window size 1024
124513 2016-05-17T15:09:39+02:00 DBG Try to publish 1024 events to logstash with window size 1024
149835 2016-05-17T15:09:42+02:00 DBG Try to publish 1808 events to logstash with window size 1024
149845 2016-05-17T15:09:45+02:00 DBG Try to publish 784 events to logstash with window size 1024

not coherent isn't it ?


(Steffen Siering) #11

With default filebeat config you can ignore the bulk processing worker. The worker is part of the internal publisher pipeline doing event normalization (check errors, filter events). The bulk processing worker is kinda similar to the spooler in filebeat with flush timer (used by other beats) serving potentially multiple configured outputs. filebeat uses an option to bypass the bulk worker and forward events directly to configured outputs.

The default spool_size in filebeat is 2048, but the output will split up batches according to current window size to not "overload" logstash.

For example:

65495 2016-05-17T15:09:25+02:00 DBG Try to publish 1191 events to logstash with window size 1024
65501 2016-05-17T15:09:28+02:00 DBG Try to publish 167 events to logstash with window size 1024

the batch send by spooler has had 1191 events included. The window size is 1024. 1191 - 1024 = 167. The second message having 167 events is the second send attempt of the original batch (we do not drop events, so we have to split) including only those events not being send in first window. The timestamps indicate, processing first window took like 3 seconds (including send + porcessing in logstash + receive ACK). This is exactly what dynamic windowing is about, split one big request into multiple smaller ones.

The bug is, the default bulk_max_size in logstash output being 1024 by default in 1.x releases (it should be fixed for 5.0). See github issue I've posted. The output worker (unfortunately using default of 2048), will split the original batch from spooler (potentially load-balancing the split-up batches). Setting bulk_max_size, the same option will be used by output worker batch splitting and the logstash output window sizing.

If you configure bulk_max_size: 2048 (you really should try) and logstash can process in time (increase congest_threshold in beats input plugin - default of 5 seconds might be not enough) the window size will increase to 2048.


(system) #12