Logstash not process large XML with multiple xpath

Hello people!

I have the following problem:

1 - I have an XMLs with estimated size 69MB and my configuration is as follows:

input {
	file {
		path => "/tmp/ACOR_2019.XML"
		start_position => "beginning"
		sincedb_path => "NUL"
		max_open_files => 65000
		codec => multiline {
			charset => "ISO-8859-1"
            		pattern => "^<ACOR.*"
            		negate => "true"
            		what => "previous"
            		auto_flush_interval => 2
    }
	}
}
filter {
	xml {
			source => "message"
			force_array => "false"
			store_xml => "false"
			xpath => [
			    "/ACOR/DOCN/text()", "DOCN",
			    "/ACOR/TIPO/text()", "TIPO",
			    "/ACOR/REG/text()", "REG",
			    "/ACOR/CLAS/text()", "CLAS",
			    "/ACOR/DCLA/text()", "DCLA",
			    "/ACOR/NUM/text()","NUM",
				"/ACOR/UF/text()","UF",
				"/ACOR/DECI/text()","DECI",
				"/ACOR/DTDE/text()","DTDE",
				"/ACOR/CORG/text()", "CORG",
			    "/ACOR/ORG/text()", "ORG",
			   "/ACOR/EMEN_S/EMEN/text()", "EMEM"
			    "/ACOR/RELA/text()", "RELA",
			    "/ACOR/NREL/text()", "NREL",
			    "/ACOR/REVI/text()","REVI",
				"/ACOR/NREV/text()","NREV",
				"/ACOR/RACO/text()","RACO",
				"/ACOR/NRAC/text()","NRAC",
			    "/ACOR/INDE_S/INDE/text()","INDE",
				"/ACOR/STIP_S/STIP/text()","STIP",
				"/ACOR/OPIX/text()","OPTIX",
				"/ACOR/DTIX/text()","DTIX",
			    "/ACOR/SUCE_S/SUCE/text()","SUCE",
				"/ACOR/TSUC/text()","TSUC",
				"/ACOR/FONT_S/FONT/text()","FONT",
				"/ACOR/DTPB/text()","DTPB",
			    "/ACOR/VEJA/text()","VEJA",
				"/ACOR/DOUT_S/DOUT/text()","DOUT",
				"/ACOR/REF_S/REF/text()","REF",
				"/ACOR/DTIN/text()","DTIN",
				"/ACOR/OPIN/text()","OPIN",
				"/ACOR/DTAL/text()","DTAL",
				"/ACOR/OPAL/text()","OPAL",
				"/ACOR/CDOC/text()","CDOC",
				"/ACOR/NOTA/text()","NOTA",
				"/ACOR/DTRV/text()","DTRV",
				"/ACOR/AREV/text()","AREV",
				"/ACOR/LINK/text()","LINK",
				"/ACOR/COMP/text()","COMP",
				"/ACOR/LREF/text()","LREF",
				"/ACOR/CLAP/text()","CLAP",
				"/ACOR/DCOM/text()","DCON",
				"/ACOR/IDTQ/text()","IDTQ",
				"/ACOR/NDTQ/text()","NDTQ",
				"/ACOR/MDTQ/text()","MDTQ",
				"/ACOR/RDTQ/text()","RDTQ",
				"/ACOR/INDX/text()","INDX",
				"/ACOR/NOTF/text()","NOTF",
				"/ACOR/DOUF/text()","DOUF",
				"/ACOR/TPOB/text()","TPOB",
				"/ACOR/OPCL/text()","OPCL",
				"/ACOR/DTCL/text()","DTCL",
				"/ACOR/DISP/text()","DISP",
				"/ACOR/DTCF/text()","DTCF",
				"/ACOR/ACON/text()","ACON",
				"/ACOR/IDL/text()","IDL",
				"/ACOR/INDW_S/INDW/text()","INDW",
				"/ACOR/PART_S/PART/text()","PART",
				"/ACOR/FNRE/text()","FNRE",
				"/ACOR/TSJU/text()","TSJU",
				"/ACOR/CSCO/text()","CSCO",
				"/ACOR/DTMN/text()","DTMN",
				"/ACOR/AMON/text()","AMON"
        ]
      remove_field => ["message"]
    }
}
output {
	elasticsearch {
		hosts => ["http://localhost:9200"]
		index => "ac-%{+yyyy}"
		user => "elastic"
		password => "XXXXX"
	}
#	stdout {}
}

2 - When I run Logstash with debug on I see that at some point ingestion and filtering stops running. However no messages are displayed in the logstash logs. Looking at the logs, see what pipeline processing has stopped collecting and filtering as shown in the accompanying figure:
This Filter Process:


note the stop time
This Input Process:

Analyzing my server even the logs showing no more events I see that all processing is being consumed by Logstash as shown below:

As mentioned earlier logstash logs have no errors or anomalies.

3 - When I remove the filter XML messages are processed correctly adding a total of 10604 events and with the filter on it only processes between 1300 to 1400 events

4 - I have a feeling that xml filter can't handle large xml files.

5 - My environment has the following configuration:
CPU 8 Core
6GB Heap
SSD Disk M2
Logstash 7.3.1

6 - This is example my XML:
https://raw.githubusercontent.com/tornis/elastic/master/example.xml

Has anyone had this problem?

Thank you

If that logstash instance is using more than half the server then the hot threads API might help.

Hi Badger!

See the result of my hot thread logstash:

What I realize is that in class:
org.apache.xerces.impl.XMLScanner.scanAttributeValue (Unknown Source)

Unkown Source value appears on hot threads

Look:

::: {}
Hot threads at 2019-09-03T19:32:58-03:00, busiestThreads=10: 
================================================================================
72.19 % of cpu usage, state: runnable, thread name: '[main]>worker7', thread id: 32 
	org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
	org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
	org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
	org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
	org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
	org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
	org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
	org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
	nokogiri.internals.NokogiriDomParser.parse(NokogiriDomParser.java:94)
	nokogiri.internals.XmlDomParserContext.do_parse(XmlDomParserContext.java:262)
	nokogiri.internals.XmlDomParserContext.parse(XmlDomParserContext.java:248)
	nokogiri.XmlDocument.newFromData(XmlDocument.java:329)
	nokogiri.XmlDocument.read_memory(XmlDocument.java:343)
	java.lang.invoke.LambdaForm$DMH/172032696.invokeStatic_L3_L(LambdaForm$DMH)
	java.lang.invoke.LambdaForm$BMH/1155300221.reinvoke(LambdaForm$BMH)
	java.lang.invoke.LambdaForm$MH/6320204.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/1102463008.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/6320204.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/1102463008.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/111238003.linkToCallSite(LambdaForm$MH)
	home.tornis.temp.logstash_minus_7_dot_3_dot_1.vendor.bundle.jruby.$2_dot_5_dot_0.gems.nokogiri_minus_1_dot_10_dot_4_minus_java.lib.nokogiri.xml.document.RUBY$method$parse$0(/home/tornis/temp/logstash-7.3.1/vendor/bundle/jruby/2.5.0/gems/nokogiri-1.10.4-java/lib/nokogiri/xml/document.rb:62)
	java.lang.invoke.LambdaForm$DMH/2100961961.invokeStatic_L7_L(LambdaForm$DMH)
	java.lang.invoke.LambdaForm$BMH/2084706642.reinvoke(LambdaForm$BMH)
	java.lang.invoke.LambdaForm$MH/1989335500.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/2059425952.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/1989335500.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/2059425952.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/663824117.linkToCallSite(LambdaForm$MH)
	home.tornis.temp.logstash_minus_7_dot_3_dot_1.vendor.bundle.jruby.$2_dot_5_dot_0.gems.nokogiri_minus_1_dot_10_dot_4_minus_java.lib.nokogiri.xml.RUBY$method$XML$0(/home/tornis/temp/logstash-7.3.1/vendor/bundle/jruby/2.5.0/gems/nokogiri-1.10.4-java/lib/nokogiri/xml.rb:35)
	java.lang.invoke.LambdaForm$DMH/2100961961.invokeStatic_L7_L(LambdaForm$DMH)
	java.lang.invoke.LambdaForm$BMH/1805402813.reinvoke(LambdaForm$BMH)
	java.lang.invoke.LambdaForm$MH/510854293.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/202395890.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/510854293.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/202395890.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/2146285461.linkToCallSite(LambdaForm$MH)
	home.tornis.temp.logstash_minus_7_dot_3_dot_1.vendor.bundle.jruby.$2_dot_5_dot_0.gems.logstash_minus_filter_minus_xml_minus_4_dot_0_dot_7.lib.logstash.filters.xml.RUBY$method$filter$0(/home/tornis/temp/logstash-7.3.1/vendor/bundle/jruby/2.5.0/gems/logstash-filter-xml-4.0.7/lib/logstash/filters/xml.rb:144)
	java.lang.invoke.LambdaForm$DMH/2100961961.invokeStatic_L7_L(LambdaForm$DMH)
	java.lang.invoke.LambdaForm$BMH/1802555899.reinvoke(LambdaForm$BMH)
	java.lang.invoke.LambdaForm$MH/597190999.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/1043822951.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/597190999.delegate(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/1043822951.guard(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/852760745.linkToCallSite(LambdaForm$MH)
	home.tornis.temp.logstash_minus_7_dot_3_dot_1.logstash_minus_core.lib.logstash.filters.base.RUBY$method$do_filter$0(/home/tornis/temp/logstash-7.3.1/logstash-core/lib/logstash/filters/base.rb:143)
	java.lang.invoke.LambdaForm$DMH/2100961961.invokeStatic_L7_L(LambdaForm$DMH)
	java.lang.invoke.LambdaForm$BMH/1802555899.reinvoke(LambdaForm$BMH)
	java.lang.invoke.LambdaForm$MH/1234757451.guardWithCatch(LambdaForm$MH)
	java.lang.invoke.LambdaForm$MH/944757319.identity_L(LambdaForm$MH)

I would try to get a smaller reproducible test case. Add "--pipeline.batch.size 1" to the command line, run again, and see what the last entry logged by the multiline filter is (2334 in the above output). Remove everything after that from the input file and see if the problem is reproducible. Similarly, remove the first 2333 entries and see if it is reproducible. Then simplify the XML filter. Does it occur if you just have one xpath entry? Does it happen if you keep half of the xpath entries?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.