XML on Elasticsearch

Hi Guys!

I need index file XML on cluster Elasticsearch, following a flux:

S3 -> Logstash -> Elasticsearch

I read about xpath on xml filter, BUT my xml is very large then I don't get map all xml

I need that each field on XML be a field on Elasticsearch, like this:

This is a very simple example a XML file:

<?xml version="1.0" encoding="ISO-8859-1"?><FAT><DATA><CLIENT Name="bla bla bla" A_C="bla01" Id="001" CP="00981726"></CLIENT></DATA></FAT></xml>

XML filter is a better option to do this ?

That is not valid XML (it opens <CLIENT> and closes </CLIENTE>). Also, you need to strip off the </xml> which can be done using

mutate { gsub => ["message", "</xml>$", ""] }

Then you can parse it using

xml { source => "message" store_xml => true target => "theXML" force_array => false }

which gets you

    "theXML" => {
    "DATA" => {
        "CLIENT" => {
             "A_C" => "bla01",
              "CP" => "00981726",
            "Name" => "bla bla bla",
              "Id" => "001"
        }
    }
}

Hi @Badger

I configured mutate and xml, but I'm receveid this error:

:exception=>#<REXML::ParseException: missing attribute quote Line: 1 Position: 62576 Last 80 unconsumed characters:

My config:

filter { mutate { gsub => ["message", "</xml>$", ""] } }

filter { xml { source => "message" store_xml => true target => "theXML" force_array => false } }

The index is created on Elasticsearch, but all fields without field message

Like this:

"_index" : "teste-2018.08", "_type" : "doc", "_id" : "_1FW9mQBCtVZHh-PRMtz", "_score" : 1.0, "_source" : { "tags" : [ "_xmlparsefailure" ], "@timestamp" : "2018-08-01T16:33:22.758Z", "@version" : "1", "message" : "<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1\\\"?><FAT ...

Immediately after that error message it will show the XML that has an issue. I suspect your XML looks like this

<foo><bar a=1/></foo>

That is not valid XML. It has to be

<foo><bar a="1"/></foo>

You might be able to fix the "XML" using stuff like

mutate { gsub => [ "message", "( a=)([^/> ]+)([/> ])", '\1"\2"\3' ] }

@Badger

Yes, my XML is valid, look:

<?xml version=\\\"1.0\\\" encoding=\\\"ISO-8859-1\\\"?><FAT><DATA><CLIENT Nome=\\\"bla bla\\\" A_C=\\\"bla01 - .\\\" Id=\\\"0010\\\" CP=\\\"00098281\\\"></CLIENT></DATA></FAT></xml>

This is only a part of XML, there is much that 2.000 lines

As I said, immediately after the error message is the problematic XML.

[2018-08-01T12:55:15,376][WARN ][logstash.filters.xml     ] Error parsing xml with XmlSimple {:source=>"message", :value=>"<foo><bar a=1></foo>", :exception=>#<REXML::ParseException: missing attribute quote
Line: 1
Position: 20
Last 80 unconsumed characters:
<bar a=1></foo>>, 

Are you able to post the full error message including the unconsumed characters?

@Badger Sure!

:exception=>#<REXML::ParseException: missing attribute quote Line: 1 Position: 102125 Last 80 unconsumed characters: <CLIENT Nome=\"bla bla \" A_C=\"bla01 - .\" Id>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/baseparser.rb:374:inpull_event'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/baseparser.rb:185:in pull'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/treeparser.rb:23:inparse'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:288:in build'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:45:ininitialize'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:971:in parse'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:164:inxml_in'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:203:in xml_in'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-xml-4.0.5/lib/logstash/filters/xml.rb:182:infilter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:inblock in multi_filter'", "org/jruby/RubyArray.java:1734:in each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:inmulti_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:47:in multi_filter'", "(eval):69:inblock in filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:445:in filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:424:inworker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:386:in block in start_workers'"]}

The log message adds a > to the XML. So the end of the XML it is consuming is

CLIENT Nome=\"bla bla \" A_C=\"bla01 - .\" Id

I think the XML might be truncated. What input are you using?

@Badger

My input is very simple:

input { s3 { "bucket" => "fat" "prefix" => "XML/XX/2018/07/13/1/00" } }

Each file xml have at about 350 KB

I cannot reconcile that error message with the source code unless the input event literally ended at Id.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.