Logstash OOM - understanding heap sizing

Hello

Could you help me understand logstash's memory requirements? I am currently getting these errors:

[2017-10-11T09:45:02,586][INFO ][org.logstash.beats.BeatsHandler] Exception: java.lang.OutOfMemoryError: Java heap space, from: /<ip>:46818

Logstash is configured with ~4GB Heap. In jvm.options:

-Xms3909m
-Xmx3909m

The GC is obviously running all the time, with 1 core of a multi-core system being more or less constantly at 100%. Could you help me understand how I should size the heap for logstash?

Thanks,
Dan

Update: I believe this is related to having a large number of events in the persistent queue. I currently have over 130 million events in the queue, with a queue size of 200GB. The OOM errors happen directly after logstash startup, and the number of events in the queue is (slowly) going down.
I am not a Java developer, could it be that such a large queue is somehow not memory-friendly?

2 Likes

I increased the heap to ~4GB and now the OOM errors are gone, but GC is still running all the time and the performance is slowed to a crawl:

247.157: [Full GC (Allocation Failure) 247.157: [CMS249.912: [CMS-concurrent-mark: 4.669/4.704 secs] [Times: user=26.71 sys=1.33, real=4.70 secs]
 (concurrent mode failure): 3494335K->3494335K(3494336K), 11.7674103 secs] 4874495K->3990390K(4874496K), [Metaspace: 60628K->60628K(1103872K)], 11.7676024 secs] [Times: user=18.68 sys=0.82, real=11.76 secs]

Which version are you running? What does your config look like?

I'm running 5.6.3.

The config has several input plugins (beats, gelf, tcp, udp), a number of filters, mostly grok, and a single output to elasticsearch. The same config has been running without issues for months, and is currently running on another 5.6.3 with no issues. I'd rather not post it here without sanitizing it first; let me know if you need to see the details and I'll work on it.

If you have that much data in your persistent queue it sounds like either Elasticsearch or Logstash is not able to keep up with the load. Have you identified what the bottleneck is?

I am however not sure how the persistent queue uses memory, so will leave that for someone else.

The bottleneck is definitely Logstash. It's using up all the heap while processing the messages in the queue. I'm not sure exactly why this happens, but surely the size of the queue should not require a large heap to process. Logstash should be processing the queue in its own time, so memory usage should not depend on the rate of events incoming in the queue either. I'm beginning to think this might be a bug.

Logstash will only work as fast as the slowest output allows, and if it is not able to keep up, Logstash will start filling up the persistent queue. How did you determine that Logstash is actually the bottleneck?

This is not an output problem. Logstash is running GC all the time, which is slowing it down. This can be seen in the GC logs posted above.

I also ran a test by stopping Logstash, renaming the queue directory, and re-starting. (Basically restarting Logstash with an empty queue.) Logstash runs normally with no events in the queue. If the logs above do not convince you, then this screenshot hopefully will:

image

Update: I want to clarify a little why I believe Logstash is the bottleneck. It is running on a 24-core machine. Before clearing the queue, the CPU utilization is 100% on one core (by the logstash java process), and 0% on all the others. This is consistent with high memory pressure and continuous garbage collection. The continuous garbage collection is corroborated by the high heap usage as seen in the graphs, reported by Logstash itself, and the long garbage collection duration as seen in the GC logs posted above. The single-threaded performance of the GC becomes a bottleneck, and the rate of emitted events plummets.

After clearing the queue, the CPU utilization is normal (I'd say ~50% on all cores in parallel) for a multi-threaded application. The heap usage shows a normal multi-threaded sawtooth pattern. The improved situation is also reflected in the increased number of events processed per second.

I agree that the current situation with high heap usage and gc, potentially due to the size of the persistent queue, is a Logstash problem. I do however not know why it behaves that way.

Before this occurred you have however managed to build up a lot of data in the persistent queue, and this suggests to me that Elasticsearch may not be able to keep up with your throughput, especially is Logstash has access to that many cores and under normal conditions does not saturate them. If Elasticsearch (of possibly the Logstash processing pipeline, which consumes data from the persistent queue) was not the bottleneck, the persistent queue should not continually grow.

The persistent queue uses memory mapped files and these are read into off-heap memory. Such a big queue will have thousands of page and checkpoint files. Only a few of these is in memory at a time. The 'head' page, where the writes go to, and a few tail pages, where the reads and acks are done.

There is the number of 'events in flight' to consider. A batch of 125 events is read by each worker thread, say 24 workers, then 125 x 24 = 3000 events in LS at any one time - now when a batch of events are JSON serialized to bulk index to Elasticsearch the memory is doubled for that batch. As the Elasticsearch is multithreaded each workers batch can be serialised into memory.

To calculate how much memory to give to LS, one needs to know what the average and max size of your events are.
For a 1 MB average event size 3000 event can mean a 6000MB memory consumption. But each batch is a mixed bag of events though.

However, from your reported queue size of 200GB @ 130 million events we get 2e11 / 130e6 = 1538 bytes per event or 1.5KB and at that size 3000 in-flight events would consume only 9MB of heap.

Have you changed the batch size? Do you know whether you have any extremely large events? Say 100MB+.

@dandrestor
In addition to what Christian said, did your Elasticsearch cluster go off line at some point, for a long time? Or in other words can you explain forensically how the queue grew to that size? For example, was it fairly small for some days then suddenly increased in size, or has there been a steady increase over the course of X days?

A steadily growing queue obviously means that the incoming rate of events is consistently higher than the outgoing rate. The solution to this is to increase the outgoing rate by using Kafka and multiple LS instances or to give more resources to the Elasticsearch cluster and to tune it or both.

When it grows during business hours and shrinks overnight this means that the queue is doing its job of queuing.

We know you have the beats input but how many upstream machines is beats collecting events from?
What filters are you using?
Have you done any tuning of your Elasticsearch cluster?
Do you have daily logstash indices and if so how many documents are in each daily index? e.g. 864000 docs means an indexing rate of 10 docs per second (presuming that LS will be pumping docs at ES 24/7 at a constant-ish rate b/c the queue has 130 mil events backlog).

Do you have a staging/test machine with the same specs as the LS one? Could you take a copy of the queue from the main LS (stop it first) and paste it into the data dir of the LS on the second machine but change the output to stdout {}? How long does your config take to drain the 130 million events?

Guy, thanks for responding and the detailed write-up.

These numbers should be accurate. 130 million is the number of events as reported by the Logstash monitoring API, and 200GB is the size of the queue directory as queried by the OS. The average event size is in the right neighbourhood, with the largest event being Windows Security Event Log events (which can be a little verbose, but not larger than 1-2k).

I have not changed the batch size, it should be running with the default. No events should be larger than the Windows events mentioned above.

The queue grew as a result of Logstash being offline over the weekend due to Persistent Queues fails to deserialize after writing extremely large numeric values · Issue #8379 · elastic/logstash · GitHub
This bug caused only the filter/output workers of the Logstash instance to crash, while the inputs continued to run, and events kept accumulating in the queue. This morning I upgraded to 5.6.3 which contains the fix, but discovered that Logstash can't process the large accumulated queue.

Under normal conditions, this particular Logstash machine and the associated ES output are fast enough to meet the demand, even during business hours. I usually have zero queued events.

A few (less than 15) Beats agents, Filebeats and Metricbeats.

Mainly grok filters, but on a minority of events.

Nothing special, apart for the recommended settings like heap size, mlockall, etc.

Yes, but indexes are also separated by originating system, so I have multiple indexes per day. I have around 100 million events for one day, across all indices for any given day. Of these, I would estimate that about 60-70% come through this particular Logstash instance that is exhibiting the problem.

I do have such a staging machine, I will perform the test tomorrow and get back to you with the results.

I haven't been able to run exactly the test you asked, but rather something equivalent. Let me know if this data is enough.

This was run on the exact same machine. I stopped Logstash, replaced the output configuration with stdout {}, replaced the queue directory with the full one, and re-started. I didn't leave it running until the queue was drained, but instead made a note of event rates. I observed the exact same behaviour (high heap, GC CPU-bound). This is how it looked like:

In the picture, the first and third graph segments are with an empty queue, output to ES. The test I made is the middle segment.

To summarize:

  • Empty queue, output to ES => LS able to keep up with input event rate of ~4000eps.
  • Full queue, output to ES => GC CPU-bound, LS able to drain queue at less than 500eps.
  • Full queue, output to stdout => GC CPU-bound, LS able to drain queue at less than 500eps.

And to answer your question, assuming an average of 500eps (which is high), it would take three days to drain the queue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.