Large XML crashes logstash with OOM


(Satish) #1

Hi,

Working on a setup with large XML files as part of the logs. The XMLs at times could be 10,20MB large. Facing issues with such large files and logstash crashing with OutOfMemory error, or entity expansion grown too large errors intermittently. Anyone came across such situation and know a way out? Would really appreciate your help in this regard.

Cheers,
Satish/


Logstash Out of memory
(Satish) #2

@PhaedrusTheGreek It could be reproduced using the below configuration file:

input
{
        file
        {
                path => "/tmp/xml.log"
        }
}

filter
{
        multiline
        {
                pattern => "^<ns0:"
                negate => true
                what => previous
        }
        xml
        {
                source => [ "message" ]
                target => [ "x" ]
        }
}

output
{
        stdout
        {
                codec => rubydebug
        }
}

The input XML gist is here: https://gist.github.com/asatsi/330e5c23830752d53bee

FYI, I am running logstash 1.5.3.


(Magnus B├Ąck) #3

Logstash's default JVM heap is 500 MB and I think that should be enough for parsing a 20 MB XML file. Have you tried increasing the heap size? Depending on how you start Logstash you can do that via /etc/default/logstash, /etc/sysconfig/logstash, or by setting the LS_HEAP_SIZE environment variable. Try "1024m" for starters.


(Satish) #4

Increasing the heap size to 4096m helped avoid the crashes for now. Thanks!


(Jay Greenberg) #5

@asatsi, I am able to reproduce the problem using your configuration and the provided XML file. Because each XML object is 20MB, this is expected. The XML DOM parsing library explodes each XML document into a much larger object in memory, so the only workaround would be to do as you did and increase the Java Heap size.

Also, please use caution if you intend to aggressively index documents of these size into Elasticsearch. Search and Aggregation should perform well, but Indexing, Retrieving and Merging will be Disk intensive.


(system) #6