XML filter issues

I am using the XML filter to filter data from S3 files and it is working for some of the documents, but I am getting a lot of errors on other documents. From Elasticsearch, I see it rejects a lot of them with this error:

[2018-03-17T19:16:06,918][DEBUG][o.e.a.b.TransportShardBulkAction] [us-patent-grants-2018.03.17][1] failed to execute bulk item (index) BulkShardRequest [[us-patent-grants-2018.03.17][1]] containing [index {[us-patent-grants-2018.03.17][doc][L7VjNWIBNxeDr0PE-AuG], source[n/a, actual length: [168.6kb], max length: 2kb]}]

and from Logstash logs I get:

Line: 2
Position: 39
Last 80 unconsumed characters:
>, :backtrace=>["uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/baseparser.rb:341:in `pull_event'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/baseparser.rb:185:in `pull'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/parsers/treeparser.rb:23:in `parse'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:288:in `build'", "uri:classloader:/META-INF/jruby.home/lib/ruby/stdlib/rexml/document.rb:45:in `initialize'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:971:in `parse'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:164:in `xml_in'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/xml-simple-1.1.5/lib/xmlsimple.rb:203:in `xml_in'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/logstash-filter-xml-4.0.5/lib/logstash/filters/xml.rb:187:in `filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in `do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:in `block in multi_filter'", "org/jruby/RubyArray.java:1734:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:in `multi_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:47:in `multi_filter'", "(eval):42:in `block in filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:447:in `filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:426:in `worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:385:in `block in start_workers'"]}

I'm not really sure what this means or why it's happening. Any ideas?

I am using the multiline input codec:

codec => multiline {
  pattern => "^*"
  what => "next"
  max_lines => 10000
  max_bytes => "100 MiB"
}

and the xml filter:

xml {
  source => "message"
  store_xml => true
  target => "message"
}

I think I found the issue, but I'm not sure. The source XML doc contains this structure:

<claim id="CLM-00023" num="00023">
<claim-text>23. The method for assessing technical skills of a student undergoing evaluation in a practical training environment according to <claim-ref idref="CLM-00017">claim 17</claim-ref>, wherein:
<claim-text>said given student is assigned a unique IP address relevant to a given computer said student is cyber training on, to enable said physical scoring server to associate said attributed command information with said IP address.</claim-text>
</claim-text>
</claim>

So basically, there is <claim-text> lkdlkjsdfsdf <claim-text> lkdlkjsdfsdf </claim-text></claim-text> (text data on the same level as a tag).

So I'm not sure how this would work.

How does the XML filter handle this? Since the parent <claim-text> tag contains another <claim-text> tag, I would think it would parse it as an object, but in that case, what would the key be for the text on the same level as the inner <claim-text> tag?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.