Logstash throws runtime exception while parsing huge XML

Hello,

In Logstash I am trying to replace a prefix in the XML which is not having any definition and then indexing it to elastic. Receiving error while parsing some huge XMLs. I would be requiring the full XML to be stored in elastic for analytics. PFB error snippet & yml file for Logstash configuration.

Error -

exception=>#<RuntimeError: entity expansion has grown too large>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/text.rb:399:in block in unnormalize'", "org/jruby/RubyString.java:3056:in gsub'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/text.rb:396:in unnormalize'"`

input {
      beats {
        port => "61000"
        client_inactivity_timeout => 3600
      }
    }
    filter{
        mutate{
            gsub => [
                "message", "<L:", "<",
                "message", "</L:", "</",
                "message", "&lt;L:", "&lt;",
                "message", "&lt;/L:", "&lt;/"
            ]
        }
        xml{
            source => "message"
            store_xml => true
            xpath => ["//RECORD/Name/text()","capability","//RECORD/DATE/text()","message_date","//RECORD/TIME/text()","message_time"]
            target => "xml_message"
        }

        grok {
            match => ["message_date", "(?<month>20.{5})"]
        }

        mutate {
            lowercase => [ "capability" ]
            add_field => {
            "message_dateTime" => "%{message_date}T%{message_time}"
            }
            remove_field => ["message_date","message_time","host"]
        }
    }
    output {
      elasticsearch {
                hosts => ["XXXXXX"]
                index => "%{capability}_%{+YYYY_ww}"
        }
    }

Kindly let me know if there is a possibility to increase any parameter value to make huge XMLs parse or any workaround for this please.

Cheers,
Maadavan

The error is occurring in the rexml library, which is used by the XmlSimple library that the logstash xml filter uses. The default limit on the size of an XML entity is 10 KB. Note that, as far as I can see, the problem is not the size of the XML document, it is the size of an entity within that document.

There is no way to pass rexml configuration options to the xml filter.

The limit is a class variable. So let me say that it would be a terrible, terrible idea to use a ruby filter to set it before calling the xml filter. Do not do it.

If you do not actually need to store the entire document then if you use the xpath option the XML is parsed using nokogiri instead of XmlSimple. That may not have the same limits.

Hello @Badger,

Thanks for your swift reply. Few queries please.

  1. rexml library is used in XML filter only because ruby filter (in this case mutate gsub) is used prior to XML filter?

  2. If yes for the above query , I am using it to replace the invalid namespace prefix present in "message", is there a way to achieve the same without having ruby filter before XML filter?

  3. Would not be able to provide XPath for nodes to parse since many API transaction logs are being pushed to elastic using logstash, so not feasible to provide all the tag names. Any other workaround please?

  4. Would there be an end difference of storing message as xml instead of text during aggregation in Kibana? Because as a text I am able to index all these XMLs but since it is huge not able to process aggregation on those messages fields.

Cheers,
Maadavan

No, the rexml library is used whenever you set the store_xml option to true.

@Badger,

So no other option available to store the full XML? Only option is to strip out the XML tags is it?

Cheers,
Maadavan

Logstash Team,

How to avoid getting the below error? I cannot strip the XMLs, I would need the entire XML to be in elastic for aggregations.

RuntimeError: entity expansion has grown too large

Cheers,
Maadavan