[filebeat] no registry updates and high memory usage after changing tail_files from true to false


#1

Unfortunately I've noticed yet another problem with registry file updates in my setup:

Filebeat Version: 5.0.1

Steps to repro:

  1. Start filebeat with tail_files: true to get initial state of files.
tail_files: true

At this point everything works fine.

  1. After a longer moment stop Filebeat, change tail_files back to false to continue from that point.
tail_files: false

# ls -l /var/lib/filebeat/registry 
-rw-r--r-- 1 root root 504152 Nov 22 18:54 /var/lib/filebeat/registry

Observed problem: Filebeat continues delivering messages, but no registry update is ever made:

# curl -s http://localhost:6060/debug/vars | egrep "logstash|registr|harvester|prospector"
"filebeat.harvester.closed": 263,
"filebeat.harvester.files.truncated": 0,
"filebeat.harvester.open_files": 387,
"filebeat.harvester.running": 387,
"filebeat.harvester.skipped": 0,
"filebeat.harvester.started": 650,
"filebeat.prospector.log.files.renamed": 0,
"filebeat.prospector.log.files.truncated": 0,
"libbeat.logstash.call_count.PublishEvents": 432,
"libbeat.logstash.publish.read_bytes": 3408,
"libbeat.logstash.publish.read_errors": 0,
"libbeat.logstash.publish.write_bytes": 440049709,
"libbeat.logstash.publish.write_errors": 0,
"libbeat.logstash.published_and_acked_events": 719514,
"libbeat.logstash.published_but_not_acked_events": 0,
"registrar.states.cleanup": 0,
"registrar.states.update": 0,
"registrar.writes": 0
# date
Tue Nov 22 19:16:47 UTC 2016
# ls -l /var/lib/filebeat/registry 
-rw-r--r-- 1 root root 504152 Nov 22 18:54 /var/lib/filebeat/registry

END result: any subsequantial restart will trigger resending data collected from the point where registry files was updated last time (before restart in point 2))

Note: If I leave tail_false as true, I might restart filebeat without any problem and registry file WILL be updated as expected.


Filebeat stops updating the registry
#2

Additionally there is insane memory usage on boxes affected by lack of registry updates and filebeat process can grow up to several dozens of gigabytes of RES memory (until it's gonna be killed by OOM).

 1125 root      15  -5 1371m 1.3g 5048 S  5.6  2.8   8:47.67 filebeat                                                                                                                                                                                                

(ruflin) #3

Quickly tried to reproduce this with 5.0.1 and master but no success. What is your OS? Any chance to share some log files?

  • After starting filebeat with tail_files: true, did you add events and check if the registry had the correct entries?
  • Did you add events during the time filebeat was down?

Trying to find the exact steps you did to reproduce it.


#4

hI @ruflin

UPDATE: So it looks like that change of tail_files from true to false isn't a hard requirement to trigger this, I already see some nodes that were left with "true" and failed to update registry after restart too. It's just happens less often on them.

  • There is nothing meaningful in FB logs,
  • It's CentOS6
  • All affected FB instances are still delivering events properly, despite growing memory consumption (they gonna be eventually killed by OOM). My biggest filebeat instance at the moment is at 22gb of mem used:
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7259 root      15  -5 23.4g  22g 2548 S  4.6 36.4 195:38.01 filebeat         

Memstats are as follows: https://paste.fedoraproject.org/488846/47991177/

  • I've rechecked and for sure there was no second FB instance spawned at any point (to avoid situation described in https://github.com/elastic/beats/issues/2951 )
  • It's probably also not a file ulimit problem (bumped to 50k and according to filebeat stats there are usually 0.5k of files open)

Yes, I'm running Filebeat on syslog receiver boxes so logs are constantly updated (with some amount of newly created file every once a while), including period when FB is down. It looks like registry file wasn't modified with any update after first restart at all.


#5

hi @ruflin

One more UPDATE: Apparently even restart isn't required to trigger this bug. I already had more than occurrence of Filebeat instances that suddenly stopped updating registry file without any restart or second instance being spawned.

So the example timeline was as follows:
18:00 Filebeat started
00:00 Registry updates stopped,Filebeat began to consume more memory from that point.
Apart from lack of registry updates, nodes remained operational and were able to deliver events to logstash.
00:00 was also a time when some amount of files were created, but for sure it didn't hit open file limit.

So I'm wondering, is it possible that in some cases all of the registry updates are accumulated in memory instead of being written to disc? That'd explain both issues.


(ruflin) #6

Thanks for the investigation. Could you share your full config?


(ruflin) #7

Can you disable publish_async and check if you still see the issue?


#8

So that's actually very interesting.

For ~10 minutes after restarting Filebeat with publish_async: false there are zero events delivered, but Filebeat's constantly changing registry file. Plus time needed to fully stop filebeat (as to wait until filebeat process shuts down) is also incredibly long (3-5 minutes?). Worth to note that rc1 never behaved that way.

Every 2.0s: curl -s http://localhost:6060/debug/vars | grep -v memstats                                                                                                                                                                      Thu Nov 24 16:09:45 2016

{
"cmdline": ["/usr/share/filebeat/bin/filebeat","-c","/etc/filebeat/filebeat.yml","-path.home","/usr/share/filebeat","-path.config","/etc/filebeat","-path.data","/var/lib/filebeat","-path.logs","/var/log/filebeat","-httpprof","127.0.0.1:6060"],
"filebeat.harvester.closed": 0,
"filebeat.harvester.files.truncated": 0,
"filebeat.harvester.open_files": 0,
"filebeat.harvester.running": 0,
"filebeat.harvester.skipped": 0,
"filebeat.harvester.started": 0,
"filebeat.prospector.log.files.renamed": 0,
"filebeat.prospector.log.files.truncated": 0,
"libbeat.es.call_count.PublishEvents": 0,
"libbeat.es.publish.read_bytes": 0,
"libbeat.es.publish.read_errors": 0,
"libbeat.es.publish.write_bytes": 0,
"libbeat.es.publish.write_errors": 0,
"libbeat.es.published_and_acked_events": 0,
"libbeat.es.published_but_not_acked_events": 0,
"libbeat.kafka.call_count.PublishEvents": 0,
"libbeat.kafka.published_and_acked_events": 0,
"libbeat.kafka.published_but_not_acked_events": 0,
"libbeat.logstash.call_count.PublishEvents": 0,
"libbeat.logstash.publish.read_bytes": 0,
"libbeat.logstash.publish.read_errors": 0,
"libbeat.logstash.publish.write_bytes": 0,
"libbeat.logstash.publish.write_errors": 0,
"libbeat.logstash.published_and_acked_events": 0,
"libbeat.logstash.published_but_not_acked_events": 0,
"libbeat.outputs.messages_dropped": 0,
"libbeat.publisher.messages_in_worker_queues": 0,
"libbeat.publisher.published_events": 0,
"libbeat.redis.publish.read_bytes": 0,
"libbeat.redis.publish.read_errors": 0,
"libbeat.redis.publish.write_bytes": 0,
"libbeat.redis.publish.write_errors": 0,
"publish.events": 452,
"registar.states.current": 5181,
"registrar.states.cleanup": 0,
"registrar.states.update": 452,
"registrar.writes": 66
}

HOWEVER after all that time

  • nodes started up properly and are delivering events from the proper point (stored before restarted).
  • it also looks like registry is updated properly now.
  • i was able to change tail_files: true -> false without any negative impact

But really, I had to wait almost 15 minutes for FB to become operational, where it was a matter of seconds in previous versions.


(ruflin) #9

We currently have the suspicion that the above is related to publish_async and the large number of logstash instances you have listed in your config. It would be interesting to hear if you see the same behaviour if you just use 1 LS instance.

I would expect in the above that published_but_not_acked_events is > 0. I will check here if perhaps something with the counting goes wrong.

Could you share your full log files from startup until you finish filebeat?

@steffens Adding you to the conversation :wink:


#10

I can confirm that disabling publish_async helped and I'm not longer seeing:

  • increased memory usage,
  • any problems with registry updates (both while running and after restarts)
  • Smallest setup I've tested so far had only 3 LS nodes and it was affected by this issue too. I'll test with one.
  • (for now) surprisingly I've spotted no hard evidence on published_but_not_acked_events being a culprit; I had nodes with 0 non-acked events that were still affected by registry problems and I'm also having a lot of nodes that are showing non-acked events with publish_async disabled that behave just fine.

(Steffen Siering) #11

publish_async + load balancing might not play well together. Problem with async + load balancing is, batches can become transferred out of order. If one batch gets 'stuck' between retries but others can still be send, all batches might pile up in memory. Normally internal queues are bounded, so beats should stop at some point, but some combination of settings + speed differences + spurious network issues might trigger some issues, so internal queue becomes unbounded. I have some potential fix in mind, but still want to understand what exactly is going on. I wonder if problem is reproducible with publish_async and exactly one LS.

published_but_not_acked_events indicates some events could not be pulibhsed (not ACKed by LS). That is, these events must be resend (re-enqueued and load balanced one more).

The thing about the registry is, the registry requires in order updates. That is all batches published (ACKed or not) need to be cached in memory in order to guarantee registry updates being in order. If one batch gets 'stuck' for whatever reason (send to a very slow logstash instance), batches can pile up. The upper bound by bounded send-queues might not trigger due to load-balancing.

I guess issues could be triggered by load-balancing already if instances can handled batches at different speeds. The published_but_not_acked_events is often a sign LS did not respond in time and the publish request either timed out or logstash did close the connection (logstash being overloaded). That is published_but_not_acked_events can be a good indicator for some LS instances either being overloaded or having to deal with back-pressure themselves.

With this many LS instances, have you checked the LS outputs can actually deal with the load by this many LS instances?

Can you post your filebeat and LS configs? I've no idea about your settings.


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.