Topbeat does not drop events?


(Svenfa) #1

Hi all,

I'm using topbeat 1.0.0 on a few clients (Windows 2k12 and Ubuntu Server 14.04). It's sending the events loadbalanced to two logstash 2.1 instances (which send data to ES 2.0).

After a few hours of running well my logstash servers stop accepting inputs (which is another problem I'm currently dealing with) and the topbeat-processes on all my clients are going to consume more and more memory.

As soon as my logstash servers accept input again, they are flooded with all the topbeat-events which couldn't be delivered.

This is great, since no data is lost, but very bad if this happens at night and my clients are running out of memory.

I've read topbeat will try to deliver events three times per default and then drop it (as mentioned [here] (https://www.elastic.co/guide/en/beats/libbeat/current/configuration.html#_max_retries_2)). But it seems like topbeat is keeping every single event until it can be delivered.

Is there another way to tell topbeat to drop events if they cannot be send correctly?

Thanks in advance!


(Steffen Siering) #2

hi,

events should be dropped after retrying 3 times. By default topbeat has a buffer for creating batches it tries to forward. Plus some internal queues for async processing. So some data might buffer up until logstash becomes available again (but some data should be dropped in meantime).

Plus as long as a subset of events have been send, the batch is retried until all events are send. That is if logstash is still available, but only very very slow there is a chance no data being 'lost'. Still if logstash is much too slow or not responsive at all the internal queues should eventually run full and topbeat should stop from collecting new events until there is some more space in the queues.

What's the progress of topbeat memory usage over time? Have you checked the timestamp intervals (some random process memory stats for example) for being about continuous and evenly spaced?

There are some knobs you can try to improve behavior:
timeout, max_retries, bulk_max_size, worker

For example set max_retries to 0 to disable retrying. If attempt to send data fails, it will be dropped.

Will run some tests later this day to check I can reproduce the issue.


(Steffen Siering) #3

Hi,

so I did run some experiments today. The load-balancer is indeed behaving correctly. The memory usage grows due to internal queueing and batching. Memory usage will grow until all queues become saturated, but the most important queue (for generating back-pressure in topbeat) is the output worker queues, as the output worker queue will receive batched up data only. Default value (N) for the queue is 1000. Having a default bulk size of 10000 for logstash the batch size B for topbeat will likely be determined by number of processes running locally (right now 149 processes on my Mac). Due to 1 batch being in preparation and one being processed by the output workers, plus additional queues (in topbeat holding single events only) the total number of N*(3+B) events (~150K elements on my computer) will be queued up in memory. No idea how big topbeat events are on average in memory, but I killed topbeat when It did reach like 70MB (this high memory usage is intolerable for a process as small as topbeat).

Unfortunately the internal queue sizes are not configurable yet, but for testing I did set the last queue it's size to 5 (to give an idea what's happening). With queue size of 5 max memory usage on my mac was ~6.6MB.

bulk.go:113: INFO bulk forward 10 events
balance.go:261: INFO Forward failed with attempts left: 4
bulk.go:113: INFO bulk forward 10 events
bulk.go:113: INFO bulk forward 10 events
bulk.go:113: INFO bulk forward 10 events
bulk.go:113: INFO bulk forward 10 events
bulk.go:113: INFO bulk forward 10 events
bulk.go:113: INFO bulk forward 10 events
balance.go:261: INFO Forward failed with attempts left: 3
balance.go:261: INFO Forward failed with attempts left: 2
balance.go:261: INFO Forward failed with attempts left: 1
balance.go:271: INFO No more attempts left
balance.go:121: INFO Max send attempts exhausted. Dropping 10 events
balance.go:261: INFO Forward failed with attempts left: 4
bulk.go:113: INFO bulk forward 3 events
balance.go:261: INFO Forward failed with attempts left: 3
balance.go:261: INFO Forward failed with attempts left: 2
balance.go:261: INFO Forward failed with attempts left: 1
balance.go:271: INFO No more attempts left
balance.go:121: INFO Max send attempts exhausted. Dropping 10 events
balance.go:261: INFO Forward failed with attempts left: 4
bulk.go:113: INFO bulk forward 8 events

In test log you can see the queue filling up first. After total of 4 send attempts (first send attempt + max_retries=3) the events are dropped. When events are dropped the next batch is read from queue and another batch will be forwarded (message: 'bulk forward 3 events') and (message 'bulk forward 8 events').

The configuration options timeout, max_retries and worker have no affect on memory usage, due to topbeat trying to fill the queues the same moment some progress (dropping events) is made. Right now you can try to reduce bulk_max_size to reduce memory usage (can still be very high).

I opened a related github issue to make the queue size configurable.

Thanks for reporting this.


(system) #4