Filebeat and busy files

I've noticed problem with logs shipping from heavily written files:

2016-07-18T13:15:24+02:00 DBG Not harvesting, file didn't change: /var/log/app.log
2016-07-18T13:15:24+02:00 DBG Update existing file for harvesting: /var/log/app.log

but after few seconds filebeat send a few logs from before 30 minutes. Next again "Not harvesting, file didn't change", and after few seconds again sends some logs.
I've tried to remove registry file with tail_files option, and after ~15 seconds of normal log shipping filebeat again send logs lazy as above.
Filebeat 1.2.3 on CentOS 7. Installed from official RPM package.

I've tested today latest nightly version from 2016-07-19T14:18:17.000Z and problem still occures. Current time spread is 20 minutes (+2h timezone difference):

{"@timestamp":"2016-07-20T06:47:10.388Z","beat":{"hostname":"a.b","name":"a.b"},"input_type":"log","message":"2016-07-20 08:27:26,532 - [DEBUG] - from application (...)

Filebeat sends logs much slower than app writes them to the file. Most of events are not sent or have >10h spread. Here is time spread statistics (difference between event send timestamp and timestamp logged in event):
https://s32.postimg.org/ov274p8z9/snapshot7.png

Those statistics are from single file (5M lines logged daily) but I'm facing this problem with other log files on other machines in environment. I have filebeat deployed on testing env on ~ 200 machines and problem occurs everywhere.

When I'm watching using strace what filebeat do, I see nothing special. Many

futex(0xed9a68, FUTEX_WAIT, 0, NULL) = 0
select(0, NULL, NULL, NULL, {0, 100}) = 0 (Timeout)

Debug log shows nothing more special than I wrote in first post.

Can anyone support here? Is there any chance for fix, or should I get back to beaver? :slight_smile:

Can you share your configuration? Are you sending data to elasticsearch directly or via logstash?

Thanks for testing the nightly build. There is a potential issue in 1.2.3 which causes reading to stop, but this should be fixed in the nightly build.

Configuration is shipped using puppet, so it is the same on all hosts

filebeat.yml
http://pastebin.com/PEuyJR1C

conf.d/cloudapp.yml
http://pastebin.com/9tRKPaLf

Have you tried how fast filebeat processes files when publishing to local file or to console (with sending stdout to /dev/null)?

Normally filebeat can read files quite fast. Thanks to file cache by OS even more so. But filebeat also experiences back-pressure from logstash, slowing down reading (keep content in files, not in RAM).

That's a good idea. In fact I have logstash splitted to three layers: cache, index and writer with redis queues between. I made another two tests:

  1. turn off filebeat on all nodes in env except one, where I set prospector for this single log. Events are not sent quicker than before. After turning on on all hosts, filebeat floods logstash cache instances with old logs (reported here: Filebeat sends old logs after reboot), redis queues were growing, so logstash cache instances can handle much bigger load than normal (single instance about 10k events/s)

  2. set output in filebeat to file only - now filebeat works perfectly, 5700 events/s. In fact I have many java apps on this machine, and load is ~30, but CPU usage is about 50%. There is also working beaver, which send logs to other logstash stack (I'm testing filebeat with new logstash configuration) and there is no such problems. On the other hand logs from syslog or other "normal" services are sent by filebeat normally, but there are no as many events in log file. It's not a problem with those java apps, because I have VM which is a reverse-proxy server with nginx only, load is not so big, and there is similar situation with nginx access log.

set output in filebeat to file only - now filebeat works perfectly, 5700 events/s

Thats good. This indicates a problem with network or logstash. Which logstash version have you installed. There are known issues about beats input in logstash being slow when handling the protocol. Some background information in this post.

In fact I have many java apps on this machine, and load is ~30, but CPU usage is about 50%.

I don't understand. this any problem? cpu usage high, due to filebeat processing old logs on start?

On the other hand logs from syslog or other "normal" services are sent by filebeat normally, but there are no as many events in log file

which log files? filebeat missing log lines? Did you delete registry file between tests?

[quote="steffens, post:7, topic:55751"]
Thats good. This indicates a problem with network or logstash. Which logstash version have you installed. There are known issues about beats input in logstash being slow when handling the protocol. Some background information in this post.[/quote]
I'm using logstash in version 2.3.4-1 (official rpm package). It could be problem with beats input, but it looks a bit strange... some of files from the same host have logs stored quickly in elasticsearch, some of logs (most of them are busy) are stored with huge lag.
I don't see network problems... request times from java apps on this host are normal, beaver which also working on this host ships log events quickly.

For me it's not a problem. Java apps generate such load. I only see correlation between load on host and lag in log events sent by filebeat. Old logs flood is other problem

I didn't write that filebeat missing log lines.

yes

@lkosz All follow up tests above are with the nightly build or 1.2.3 ?

@ruflin both. 1.2.3 I have deployed on whole env. Nightly from 2016-07-19T14:18:17.000Z I've tested on single node.

Could you share some log files of filebeat with at least info level (best would be debug level)? Then we would see if back pressure is done by Logstash. Also can you check if you have the most recent version of the beats input plugin?

logstash config:
http://pastebin.com/7BEUSu1G

logstash seems to operate normally

filebeat logs:
http://pastebin.com/LXEuiQSJ
(I removed many lines with 'publish' and all with document content)
Single publish event looks like this:
http://pastebin.com/7uMZNy70

According versions... logstash-plugin seems to doesn't provide version show possibility. By searching filesystem I found two wersions: 2.1.3 and 2.2.9:
http://pastebin.com/teKwQ1ik
/opt/logstash is handled by rpm. I don't make any manual changes.

I see the window size still growing in filebeat logs. That is, we're still seeing the slow-start here. Increase the bulk_max_size to 2048 in logstash output (due to a bug it's default value is 1024). I didn't check for timestamps of ACK messages in logs, but these might indicate you latency by logstash to ACK events.

Given you redis output config in logstash I'd recommend to give the redis output in filebeat 5.0 alpha4 a try. This one also supports load balancing, SSL (use with stunnel for example) and SOCK5 proxy support. (Do not use redis output with filebeat 1.x releases).

Set bulk_max_size to 2048 didn't help.

Yes, I know about that, but filebeat 5.0 is in alpha version and I have another strange problem - filebeat sends many old log lines after restart, which is also a blocker for me.

Let's concentrate on one issue at a time.

Did you try with filebeat 5.0 alpha4 or you still considering filebeat->logstash->redis->xxx?

filebeat has send-at-least-once semantics. it keeps track of lines being ACKed by logstash in it's registry file. If filebeat is shutdown before logstash did ACK inflight events, filebeat has to resend those events, as it can not tell whether logstash did process those lines already or did drop them (e.g. pipeline in logstash being congested). Some goes for redis and elasticsearch output. See related enhancement request. Waiting for output ACK on shutdown might trigger some other issues. Apparently no network protocol used by any outputs does support send-exactly-once semantics (I heard kafka team is planing to add exactly-once semantics to protocol).

One potential workaround is by computing a unique id based on source, offset and timestamp (if timestamp has been extracted from log-message by logstash) and/or hash of log-line. Given computing an unique ids on some events always results on same id for said event, no duplicate events will be available in elasticsearch.

I tried filebeat 5.0 nightly to check if there are also such shipping problems. I didn't try it with direct-to-redis writing. I'm not tied up to use logstash in front of ELK stack. I'm using this funny configuration because I have to. In filebeat 1.X there is no load-balanced redis output. I could make it using haproxy, but deploying it on every server with filebeat together is not a good idea.

Regarding old logs flood - I didn't notice that there is a log duplication. But I checked ~50 events only and didn't test it long. What I see is flood after filebeat restart. When I do it on two or three machines it is hardly noticeable because 6 logstash indexer instances easily handle this traffic. It can be observed in timestamp spread calculated for every event. Problems is when I restart all filebeat instances at a time (using mcollective). Small flood * 95 = big flood, big flood + 95 * normal large traffic = 8M docs in redis queues in a few minutes :slight_smile:
Indexer is dropping logs which are to old for me (I'm keeping 30 days of logs), and are from future (not every log has year stored and logs from September last year after date parsing process are from future :slight_smile: ) so I had minimalise this problem.

What's the filebeat config used with filebeat nightly build?

Any results different from original tests with nightly build? I doubt this, as I think performance problem is mostly in logstash.

Which logstash version did you use?

How many logstash instances taking input from beats have you running? Adding more nodes or adding more workers in beats doing load balancing might help a little. Given your original config, this is how you can try to send batches in parallel with 2 workers per host (total of 6 batches in flight):

filebeat:
  spool_size: 12288

output:
  logstash:
    hosts:
      - "a.z:5044"
      - "b.z:5044"
      - "c.z:5044"
    loadbalance: true
    worker: 2
    bulk_max_size: 2048
    index: logstash

Notice how the spool_size is a bulk_max_size multiplied by number of total output workers (6 = hosts * worker).

This topic was automatically closed after 21 days. New replies are no longer allowed.