[filebeat] no registry updates and high memory usage after changing tail_files from true to false

shog · November 22, 2016, 7:15pm

Unfortunately I've noticed yet another problem with registry file updates in my setup:

Filebeat Version: 5.0.1

Steps to repro:

Start filebeat with tail_files: true to get initial state of files.

tail_files: true

At this point everything works fine.

After a longer moment stop Filebeat, change tail_files back to false to continue from that point.

tail_files: false


# ls -l /var/lib/filebeat/registry 
-rw-r--r-- 1 root root 504152 Nov 22 18:54 /var/lib/filebeat/registry

Observed problem: Filebeat continues delivering messages, but no registry update is ever made:

# curl -s http://localhost:6060/debug/vars | egrep "logstash|registr|harvester|prospector"
"filebeat.harvester.closed": 263,
"filebeat.harvester.files.truncated": 0,
"filebeat.harvester.open_files": 387,
"filebeat.harvester.running": 387,
"filebeat.harvester.skipped": 0,
"filebeat.harvester.started": 650,
"filebeat.prospector.log.files.renamed": 0,
"filebeat.prospector.log.files.truncated": 0,
"libbeat.logstash.call_count.PublishEvents": 432,
"libbeat.logstash.publish.read_bytes": 3408,
"libbeat.logstash.publish.read_errors": 0,
"libbeat.logstash.publish.write_bytes": 440049709,
"libbeat.logstash.publish.write_errors": 0,
"libbeat.logstash.published_and_acked_events": 719514,
"libbeat.logstash.published_but_not_acked_events": 0,
"registrar.states.cleanup": 0,
"registrar.states.update": 0,
"registrar.writes": 0

# date
Tue Nov 22 19:16:47 UTC 2016
# ls -l /var/lib/filebeat/registry 
-rw-r--r-- 1 root root 504152 Nov 22 18:54 /var/lib/filebeat/registry

END result: any subsequantial restart will trigger resending data collected from the point where registry files was updated last time (before restart in point 2))

Note: If I leave tail_false as true, I might restart filebeat without any problem and registry file WILL be updated as expected.

shog · November 22, 2016, 7:46pm

Additionally there is insane memory usage on boxes affected by lack of registry updates and filebeat process can grow up to several dozens of gigabytes of RES memory (until it's gonna be killed by OOM).

 1125 root      15  -5 1371m 1.3g 5048 S  5.6  2.8   8:47.67 filebeat

ruflin · November 23, 2016, 11:30am

Quickly tried to reproduce this with 5.0.1 and master but no success. What is your OS? Any chance to share some log files?

After starting filebeat with tail_files: true, did you add events and check if the registry had the correct entries?
Did you add events during the time filebeat was down?

Trying to find the exact steps you did to reproduce it.

shog · November 23, 2016, 2:38pm

hI @ruflin

UPDATE: So it looks like that change of tail_files from true to false isn't a hard requirement to trigger this, I already see some nodes that were left with "true" and failed to update registry after restart too. It's just happens less often on them.

There is nothing meaningful in FB logs,
It's CentOS6
All affected FB instances are still delivering events properly, despite growing memory consumption (they gonna be eventually killed by OOM). My biggest filebeat instance at the moment is at 22gb of mem used:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7259 root      15  -5 23.4g  22g 2548 S  4.6 36.4 195:38.01 filebeat

Memstats are as follows: https://paste.fedoraproject.org/488846/47991177/

I've rechecked and for sure there was no second FB instance spawned at any point (to avoid situation described in Init scripts do not respect shutdown_timeout · Issue #2951 · elastic/beats · GitHub )
It's probably also not a file ulimit problem (bumped to 50k and according to filebeat stats there are usually 0.5k of files open)

Yes, I'm running Filebeat on syslog receiver boxes so logs are constantly updated (with some amount of newly created file every once a while), including period when FB is down. It looks like registry file wasn't modified with any update after first restart at all.

shog · November 24, 2016, 2:57pm

hi @ruflin

One more UPDATE: Apparently even restart isn't required to trigger this bug. I already had more than occurrence of Filebeat instances that suddenly stopped updating registry file without any restart or second instance being spawned.

So the example timeline was as follows:
18:00 Filebeat started
00:00 Registry updates stopped,Filebeat began to consume more memory from that point.
Apart from lack of registry updates, nodes remained operational and were able to deliver events to logstash.
00:00 was also a time when some amount of files were created, but for sure it didn't hit open file limit.

So I'm wondering, is it possible that in some cases all of the registry updates are accumulated in memory instead of being written to disc? That'd explain both issues.

ruflin · November 24, 2016, 3:16pm

Thanks for the investigation. Could you share your full config?

ruflin · November 24, 2016, 3:54pm

Can you disable publish_async and check if you still see the issue?

shog · November 24, 2016, 4:24pm

So that's actually very interesting.

For ~10 minutes after restarting Filebeat with publish_async: false there are zero events delivered, but Filebeat's constantly changing registry file. Plus time needed to fully stop filebeat (as to wait until filebeat process shuts down) is also incredibly long (3-5 minutes?). Worth to note that rc1 never behaved that way.

Every 2.0s: curl -s http://localhost:6060/debug/vars | grep -v memstats                                                                                                                                                                      Thu Nov 24 16:09:45 2016

{
"cmdline": ["/usr/share/filebeat/bin/filebeat","-c","/etc/filebeat/filebeat.yml","-path.home","/usr/share/filebeat","-path.config","/etc/filebeat","-path.data","/var/lib/filebeat","-path.logs","/var/log/filebeat","-httpprof","127.0.0.1:6060"],
"filebeat.harvester.closed": 0,
"filebeat.harvester.files.truncated": 0,
"filebeat.harvester.open_files": 0,
"filebeat.harvester.running": 0,
"filebeat.harvester.skipped": 0,
"filebeat.harvester.started": 0,
"filebeat.prospector.log.files.renamed": 0,
"filebeat.prospector.log.files.truncated": 0,
"libbeat.es.call_count.PublishEvents": 0,
"libbeat.es.publish.read_bytes": 0,
"libbeat.es.publish.read_errors": 0,
"libbeat.es.publish.write_bytes": 0,
"libbeat.es.publish.write_errors": 0,
"libbeat.es.published_and_acked_events": 0,
"libbeat.es.published_but_not_acked_events": 0,
"libbeat.kafka.call_count.PublishEvents": 0,
"libbeat.kafka.published_and_acked_events": 0,
"libbeat.kafka.published_but_not_acked_events": 0,
"libbeat.logstash.call_count.PublishEvents": 0,
"libbeat.logstash.publish.read_bytes": 0,
"libbeat.logstash.publish.read_errors": 0,
"libbeat.logstash.publish.write_bytes": 0,
"libbeat.logstash.publish.write_errors": 0,
"libbeat.logstash.published_and_acked_events": 0,
"libbeat.logstash.published_but_not_acked_events": 0,
"libbeat.outputs.messages_dropped": 0,
"libbeat.publisher.messages_in_worker_queues": 0,
"libbeat.publisher.published_events": 0,
"libbeat.redis.publish.read_bytes": 0,
"libbeat.redis.publish.read_errors": 0,
"libbeat.redis.publish.write_bytes": 0,
"libbeat.redis.publish.write_errors": 0,
"publish.events": 452,
"registar.states.current": 5181,
"registrar.states.cleanup": 0,
"registrar.states.update": 452,
"registrar.writes": 66
}

HOWEVER after all that time

nodes started up properly and are delivering events from the proper point (stored before restarted).
it also looks like registry is updated properly now.
i was able to change tail_files: true -> false without any negative impact

But really, I had to wait almost 15 minutes for FB to become operational, where it was a matter of seconds in previous versions.

ruflin · November 25, 2016, 9:24am

We currently have the suspicion that the above is related to publish_async and the large number of logstash instances you have listed in your config. It would be interesting to hear if you see the same behaviour if you just use 1 LS instance.

I would expect in the above that published_but_not_acked_events is > 0. I will check here if perhaps something with the counting goes wrong.

Could you share your full log files from startup until you finish filebeat?

@steffens Adding you to the conversation

shog · November 28, 2016, 4:15pm

I can confirm that disabling publish_async helped and I'm not longer seeing:

increased memory usage,
any problems with registry updates (both while running and after restarts)

Smallest setup I've tested so far had only 3 LS nodes and it was affected by this issue too. I'll test with one.
(for now) surprisingly I've spotted no hard evidence on published_but_not_acked_events being a culprit; I had nodes with 0 non-acked events that were still affected by registry problems and I'm also having a lot of nodes that are showing non-acked events with publish_async disabled that behave just fine.

steffens · November 28, 2016, 11:07pm

publish_async + load balancing might not play well together. Problem with async + load balancing is, batches can become transferred out of order. If one batch gets 'stuck' between retries but others can still be send, all batches might pile up in memory. Normally internal queues are bounded, so beats should stop at some point, but some combination of settings + speed differences + spurious network issues might trigger some issues, so internal queue becomes unbounded. I have some potential fix in mind, but still want to understand what exactly is going on. I wonder if problem is reproducible with publish_async and exactly one LS.

published_but_not_acked_events indicates some events could not be pulibhsed (not ACKed by LS). That is, these events must be resend (re-enqueued and load balanced one more).

The thing about the registry is, the registry requires in order updates. That is all batches published (ACKed or not) need to be cached in memory in order to guarantee registry updates being in order. If one batch gets 'stuck' for whatever reason (send to a very slow logstash instance), batches can pile up. The upper bound by bounded send-queues might not trigger due to load-balancing.

I guess issues could be triggered by load-balancing already if instances can handled batches at different speeds. The published_but_not_acked_events is often a sign LS did not respond in time and the publish request either timed out or logstash did close the connection (logstash being overloaded). That is published_but_not_acked_events can be a good indicator for some LS instances either being overloaded or having to deal with back-pressure themselves.

With this many LS instances, have you checked the LS outputs can actually deal with the load by this many LS instances?

Can you post your filebeat and LS configs? I've no idea about your settings.

system · December 26, 2016, 11:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Filebeat beta1 resends random data upon every restart (registry file not updated properly?) Beats filebeat	14	2246	October 20, 2016
Filebeat stops updating the registry Beats filebeat	8	2856	January 2, 2017
It doesn't seem to be doing anything Beats filebeat	2	980	June 23, 2016
Ignore registry after a reboot Beats filebeat	3	603	November 20, 2018
FIlebeat doesn't reparse logs Beats filebeat	11	2856	July 5, 2017

[filebeat] no registry updates and high memory usage after changing tail_files from true to false

Related topics