Logstash is configured with ~4GB Heap. In jvm.options:
-Xms3909m
-Xmx3909m
The GC is obviously running all the time, with 1 core of a multi-core system being more or less constantly at 100%. Could you help me understand how I should size the heap for logstash?
Thanks,
Dan
Update: I believe this is related to having a large number of events in the persistent queue. I currently have over 130 million events in the queue, with a queue size of 200GB. The OOM errors happen directly after logstash startup, and the number of events in the queue is (slowly) going down.
I am not a Java developer, could it be that such a large queue is somehow not memory-friendly?
The config has several input plugins (beats, gelf, tcp, udp), a number of filters, mostly grok, and a single output to elasticsearch. The same config has been running without issues for months, and is currently running on another 5.6.3 with no issues. I'd rather not post it here without sanitizing it first; let me know if you need to see the details and I'll work on it.
If you have that much data in your persistent queue it sounds like either Elasticsearch or Logstash is not able to keep up with the load. Have you identified what the bottleneck is?
I am however not sure how the persistent queue uses memory, so will leave that for someone else.
The bottleneck is definitely Logstash. It's using up all the heap while processing the messages in the queue. I'm not sure exactly why this happens, but surely the size of the queue should not require a large heap to process. Logstash should be processing the queue in its own time, so memory usage should not depend on the rate of events incoming in the queue either. I'm beginning to think this might be a bug.
Logstash will only work as fast as the slowest output allows, and if it is not able to keep up, Logstash will start filling up the persistent queue. How did you determine that Logstash is actually the bottleneck?
This is not an output problem. Logstash is running GC all the time, which is slowing it down. This can be seen in the GC logs posted above.
I also ran a test by stopping Logstash, renaming the queue directory, and re-starting. (Basically restarting Logstash with an empty queue.) Logstash runs normally with no events in the queue. If the logs above do not convince you, then this screenshot hopefully will:
Update: I want to clarify a little why I believe Logstash is the bottleneck. It is running on a 24-core machine. Before clearing the queue, the CPU utilization is 100% on one core (by the logstash java process), and 0% on all the others. This is consistent with high memory pressure and continuous garbage collection. The continuous garbage collection is corroborated by the high heap usage as seen in the graphs, reported by Logstash itself, and the long garbage collection duration as seen in the GC logs posted above. The single-threaded performance of the GC becomes a bottleneck, and the rate of emitted events plummets.
After clearing the queue, the CPU utilization is normal (I'd say ~50% on all cores in parallel) for a multi-threaded application. The heap usage shows a normal multi-threaded sawtooth pattern. The improved situation is also reflected in the increased number of events processed per second.
I agree that the current situation with high heap usage and gc, potentially due to the size of the persistent queue, is a Logstash problem. I do however not know why it behaves that way.
Before this occurred you have however managed to build up a lot of data in the persistent queue, and this suggests to me that Elasticsearch may not be able to keep up with your throughput, especially is Logstash has access to that many cores and under normal conditions does not saturate them. If Elasticsearch (of possibly the Logstash processing pipeline, which consumes data from the persistent queue) was not the bottleneck, the persistent queue should not continually grow.
The persistent queue uses memory mapped files and these are read into off-heap memory. Such a big queue will have thousands of page and checkpoint files. Only a few of these is in memory at a time. The 'head' page, where the writes go to, and a few tail pages, where the reads and acks are done.
There is the number of 'events in flight' to consider. A batch of 125 events is read by each worker thread, say 24 workers, then 125 x 24 = 3000 events in LS at any one time - now when a batch of events are JSON serialized to bulk index to Elasticsearch the memory is doubled for that batch. As the Elasticsearch is multithreaded each workers batch can be serialised into memory.
To calculate how much memory to give to LS, one needs to know what the average and max size of your events are.
For a 1 MB average event size 3000 event can mean a 6000MB memory consumption. But each batch is a mixed bag of events though.
However, from your reported queue size of 200GB @ 130 million events we get 2e11 / 130e6 = 1538 bytes per event or 1.5KB and at that size 3000 in-flight events would consume only 9MB of heap.
Have you changed the batch size? Do you know whether you have any extremely large events? Say 100MB+.
@dandrestor
In addition to what Christian said, did your Elasticsearch cluster go off line at some point, for a long time? Or in other words can you explain forensically how the queue grew to that size? For example, was it fairly small for some days then suddenly increased in size, or has there been a steady increase over the course of X days?
A steadily growing queue obviously means that the incoming rate of events is consistently higher than the outgoing rate. The solution to this is to increase the outgoing rate by using Kafka and multiple LS instances or to give more resources to the Elasticsearch cluster and to tune it or both.
When it grows during business hours and shrinks overnight this means that the queue is doing its job of queuing.
We know you have the beats input but how many upstream machines is beats collecting events from?
What filters are you using?
Have you done any tuning of your Elasticsearch cluster?
Do you have daily logstash indices and if so how many documents are in each daily index? e.g. 864000 docs means an indexing rate of 10 docs per second (presuming that LS will be pumping docs at ES 24/7 at a constant-ish rate b/c the queue has 130 mil events backlog).
Do you have a staging/test machine with the same specs as the LS one? Could you take a copy of the queue from the main LS (stop it first) and paste it into the data dir of the LS on the second machine but change the output to stdout {}? How long does your config take to drain the 130 million events?
Guy, thanks for responding and the detailed write-up.
These numbers should be accurate. 130 million is the number of events as reported by the Logstash monitoring API, and 200GB is the size of the queue directory as queried by the OS. The average event size is in the right neighbourhood, with the largest event being Windows Security Event Log events (which can be a little verbose, but not larger than 1-2k).
I have not changed the batch size, it should be running with the default. No events should be larger than the Windows events mentioned above.
Under normal conditions, this particular Logstash machine and the associated ES output are fast enough to meet the demand, even during business hours. I usually have zero queued events.
A few (less than 15) Beats agents, Filebeats and Metricbeats.
Mainly grok filters, but on a minority of events.
Nothing special, apart for the recommended settings like heap size, mlockall, etc.
Yes, but indexes are also separated by originating system, so I have multiple indexes per day. I have around 100 million events for one day, across all indices for any given day. Of these, I would estimate that about 60-70% come through this particular Logstash instance that is exhibiting the problem.
I do have such a staging machine, I will perform the test tomorrow and get back to you with the results.
I haven't been able to run exactly the test you asked, but rather something equivalent. Let me know if this data is enough.
This was run on the exact same machine. I stopped Logstash, replaced the output configuration with stdout {}, replaced the queue directory with the full one, and re-started. I didn't leave it running until the queue was drained, but instead made a note of event rates. I observed the exact same behaviour (high heap, GC CPU-bound). This is how it looked like:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.