XML on Elasticsearch

Hi Guys!

I need index file XML on cluster Elasticsearch, following a flux:

S3 -> Logstash -> Elasticsearch

I read about xpath on xml filter, BUT my xml is very large then I don't get map all xml

I need that each field on XML be a field on Elasticsearch, like this:

This is a very simple example a XML file:

<?xml version="1.0" encoding="ISO-8859-1"?><FAT><DATA><CLIENT Name="bla bla bla" A_C="bla01" Id="001" CP="00981726"></CLIENT></DATA></FAT></xml>

XML filter is a better option to do this ?

That is not valid XML (it opens <CLIENT> and closes </CLIENTE>). Also, you need to strip off the </xml> which can be done using

mutate { gsub => ["message", "</xml>$", ""] }

Then you can parse it using

xml { source => "message" store_xml => true target => "theXML" force_array => false }

which gets you

    "theXML" => {
    "DATA" => {
        "CLIENT" => {
             "A_C" => "bla01",
              "CP" => "00981726",
            "Name" => "bla bla bla",
              "Id" => "001"
        }
    }
}

Hi @Badger

I configured mutate and xml, but I'm receveid this error:

:exception=>#<REXML::ParseException: missing attribute quote Line: 1 Position: 62576 Last 80 unconsumed characters:

My config:

filter { mutate { gsub => ["message", "</xml>$", ""] } }

filter { xml { source => "message" store_xml => true target => "theXML" force_array => false } }

The index is created on Elasticsearch, but all fields without field message

Like this:

"_index" : "teste-2018.08", "_type" : "doc", "_id" : "_1FW9mQBCtVZHh-PRMtz", "_score" : 1.0, "_source" : { "tags" : [ "_xmlparsefailure" ], "@timestamp" : "2018-08-01T16:33:22.758Z", "@version" : "1", "message" : "<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1\\\"?><FAT ...

Immediately after that error message it will show the XML that has an issue. I suspect your XML looks like this

<foo><bar a=1/></foo>

That is not valid XML. It has to be

<foo><bar a="1"/></foo>

You might be able to fix the "XML" using stuff like

mutate { gsub => [ "message", "( a=)([^/> ]+)([/> ])", '\1"\2"\3' ] }

@Badger

Yes, my XML is valid, look:

<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1\\\"?><FAT><DATA><CLIENT Nome=\\\"bla bla\\\" A_C=\\\"bla01 - .\\\" Id=\\\"0010\\\" CP=\\\"00098281\\\"></CLIENT></DATA></FAT></xml>

This is only a part of XML, there is much that 2.000 lines

As I said, immediately after the error message is the problematic XML.

[2018-08-01T12:55:15,376][WARN ][logstash.filters.xml     ] Error parsing xml with XmlSimple {:source=>"message", :value=>"<foo><bar a=1></foo>", :exception=>#<REXML::ParseException: missing attribute quote
Line: 1
Position: 20
Last 80 unconsumed characters:
<bar a=1></foo>>, 

Are you able to post the full error message including the unconsumed characters?

@Badger Sure!

:exception=>#<REXML::ParseException: missing attribute quote Line: 1 Position: 102125 Last 80 unconsumed characters: <CLIENT Nome=\"bla bla \" A_C=\"bla01 - .\" Id>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/baseparser.rb:374:inpull_event'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/baseparser.rb:185:in pull'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/treeparser.rb:23:inparse'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:288:in build'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:45:ininitialize'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:971:in parse'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:164:inxml_in'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:203:in xml_in'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-xml-4.0.5/lib/logstash/filters/xml.rb:182:infilter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:inblock in multi_filter'", "org/jruby/RubyArray.java:1734:in each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:inmulti_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:47:in multi_filter'", "(eval):69:inblock in filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:445:in filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:424:inworker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:386:in block in start_workers'"]}

The log message adds a > to the XML. So the end of the XML it is consuming is

CLIENT Nome=\"bla bla \" A_C=\"bla01 - .\" Id

I think the XML might be truncated. What input are you using?

@Badger

My input is very simple:

input { s3 { "bucket" => "fat" "prefix" => "XML/XX/2018/07/13/1/00" } }

Each file xml have at about 350 KB

I cannot reconcile that error message with the source code unless the input event literally ended at Id.