How to quikly parsing 1 million xml

Hello.
I have 1 million file with xml. Each file size from a few kilobytes to 20mb.
I need parsing their and put to elastic. I tried to use the next config:

file {
path => "/opt/lun1/data-unzip/ftp/228-pur-plan/.xml"
exclude => "*.zip"
type => "228-pur-plan"
start_position => beginning
max_open_files => "64"
close_older => "0.1"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "999999"
max_bytes => "20 MiB"
}
}

filter {
if [type] =~ "228-pur-plan" {
xml {
store_xml => false
source => "message"
xpath => [
<about_100_lines>
]
}

output {
if [type] =~ "228-pur-plan" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "228-pur-plan-%{+YYYY.MM}"
template => "/opt/logstash/custom_patterns/elasticsearch-template_228.json"
template_name => "228"
}
}
}

My hardware server use CentOS 6.8 and have:
80Gb memory
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
with 16 cpus.

I set logstash heap size 32gb.

But it work so slowly! I have very bad result - 4 file parses per second.
Please, how I do more quickly ? May be I must do any tuning ? OR may be I should use other input plugin?

1 Like

Try increasing the pipeline worker count.
Also, is it LS or ES that is the bottleneck?

How can I check it?

How can I check it?

What's the limiting factor? How many files per second can Logstash process alone? How many documents per second can Elasticsearch accept? Measure.

I think a major issue is that you're using a single file input which is going to run in a single thread. Splitting it into multiple file inputs is likely to help, but if you really want to improve the performance I suspect writing a program that bulk-reads XML files and feeds them to Logstash via e.g. TCP would perform much better.

1 Like

ОК. I tried tcp input too, but I have error

[FATAL][logstash.runner ] An unexpected error occurred! {:error=>#<Nokogiri::XML::XPath::SyntaxError: /ns2:purchasePlan/[local-name()='header']/[local-name()='guid']/text()>, :backtrace=>["nokogiri/XmlXpathContext.java:169:in `evaluate'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:165:in `xpath'", "org/jruby/RubyArray.java:2414:in `map'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:156:in `xpath'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:153:in `filter'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:152:in `filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in `do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:in `multi_filter'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:in `multi_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:41:in `multi_filter'", "(eval):155:in `initialize'", "org/jruby/RubyArray.java:1613:in `each'", "(eval):152:in `initialize'", "org/jruby/RubyProc.java:281:in `call'", "(eval):127:in `filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:295:in `filter_batch'", "org/jruby/RubyProc.java:281:in `call'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:192:in `each'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:191:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:294:in `filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:282:in `worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:258:in `start_workers'"]}

My tcp config;

tcp {
host => "0.0.0.0"
port => 7101
type => "228-pur-plan"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "999999"
max_bytes => "20 MiB"
}

Always post logs as preformatted text. As it stands the interesting error message has been mangled and isn't visible.

Don't use the multiline codec. Whatever feeds Logstash via TCP should slurp the whole file in one go, removing newline characters if necessary.

I try to use without multiline codec:

tcp {
host => "0.0.0.0"
port => "7101"
type => "28-pur-plan"
}

filter {
if [type] =~ "228-pur-plan" {
xml {
store_xml => false
source => "message"
xpath => [
about 100 lines
]
}

output {
if [type] =~ "228-pur-plan" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "228-pur-plan-%{+YYYY.MM}"
template => "/opt/logstash/custom_patterns/elasticsearch-template_228.json"
template_name => "228"
}
}
}

And I have the next error messages:

[2017-01-30T11:41:03,942][ERROR][logstash.pipeline ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>#<Nokogiri::XML::XPath::SyntaxError: /ns2:purchasePlan/*[local-name()='header']/*[local-name()='guid']/text()>, "backtrace"=>["nokogiri/XmlXpathContext.java:169:in evaluate'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:165:in xpath'", "org/jruby/RubyArray.java:2414:in map'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:156:in xpath'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:153:in filter'", "org/jruby/RubyHash.java:1342:in each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:152:in filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:in multi_filter'", "org/jruby/RubyArray.java:1613:in each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:in multi_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:41:in multi_filter'", "(eval):155:in initialize'", "org/jruby/RubyArray.java:1613:in each'", "(eval):152:in initialize'", "org/jruby/RubyProc.java:281:in call'", "(eval):127:in filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:295:in filter_batch'", "org/jruby/RubyProc.java:281:in call'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:192:in each'", "org/jruby/RubyHash.java:1342:in each'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:191:in each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:294:in filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:282:in worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:258:in start_workers'"]}`

Okay, so the xml filter thinks it gets garbage input. I suggest you remove newline characters from the input so that what you send to Logstash contains one file per line, but that's just a guess at what's wrong. Look at the event the xml filter is complaining about.

I do it. I remove newline characters and forward to logstash throw TCP for 3 inputs

tcp {
host => "0.0.0.0"
port => "7101"
type => "28-pur-plan"
}
tcp {
host => "0.0.0.0"
port => "7102"
type => "28-pur-plan"
}
tcp {
host => "0.0.0.0"
port => "7103"
type => "28-pur-plan"
}

But I have bad result: 4-10 xml per second
I tried output to file (instead elasticsearch) but I have 4-10 xml per second too.

Where is my bottelneck ? I think this is logstash, but where?

I tried to remove xpath xml parsing, and i have good result - over 100 xml rep sec.
Parsing is a bottleneck. How can I optimize xpath parsing or increase the amount of resources allocated to parse?

The XML support in Logstash is done by an external library which is fine for small use cases but it is very slow for xpath and other complex use cases.

You could consider an XML/XSLT to JSON preprocessor before Logstash. Example: https://github.com/bramstein/xsltjson

1 Like

OK. I convert my xml to json and test logstash parsing. My config:

input{
        unix {
                path => "/tmp/socket1"
                mode => "server"
                type => "228-pur-plan-soc"
                codec => "json_lines"
        }
        unix {
                path => "/tmp/socket2"
                mode => "server"
                type => "228-pur-plan-soc"
                codec => "json_lines"
        }
        unix {
                path => "/tmp/socket3"
                mode => "server"
                type => "228-pur-plan-soc"
                codec => "json_lines"
        }
        tcp {
                host => "0.0.0.0"
                port => "7101"
                type => "228-pur-plan-tcp"
                codec => "json_lines"
        }
        tcp {
                host => "0.0.0.0"
                port => "7102"
                type => "228-pur-plan-tcp"
                codec => "json_lines"
        }
        tcp {
                host => "0.0.0.0"
                port => "7103"
                type => "228-pur-plan-tcp"
                codec => "json_lines"
        }
}

filter {
   if [type] =~ "228-pur-plan-soc" {
        json { source => "message"  }
        metrics {
                meter => "soc_events"
                add_tag => "metrics"
        }
   }

   if [type] =~ "228-pur-plan-tcp" {
        json { source => "message"  }
        metrics {
                meter => "tcp_events"
                add_tag => "metrics"
        }
   }

}

output {
         if [type] =~ "228-pur-plan" {
                file {
                    path => "/opt/lun1/out.dat"
                }
       }
        if "metrics" in [tags] {
                elasticsearch {
                        hosts => "127.0.0.1:9200"
                        index => "ls_metrics-%{+YYYY.MM}"
                }
        }
}

And I again got bad results. How can I take more speed?

Beats ΠΏΡ€ΠΎΠ±ΠΎΠ²Π°Π» Π½Π°ΡΡ‚Ρ€ΠΎΠΈΡ‚ΡŒ?
Он ΠΏΡ€Π΅Π΄Π½Π°Π·Π½Π°Ρ‡Π΅Π½ ΠΊΠ°ΠΊ Ρ€Π°Π· для чтСния Ρ„Π°ΠΉΠ»ΠΎΠ².

English please.

You tried to set up the Beats (Filebeat)?
It is designed just for reading files and sending in ES

Do you mean that my problem with input?

What speed do you expect and what is the speed of your hard drive?
ES is on the same server where the files are read?

4-10 xml per second ~ 80-200 Mb/s - average hdd speed
perhaps you used ssd?

I expect more than 50 json (or xml) per sec.

I tried test in another server (Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz with 40 CPUs), and I have the same result - 4-10 json per sec.

If you want to test whether ES is your bottleneck you could try the redis-output talking to Redis. This will use bulk updates and networking and therefore approximates the ES output and downstream.

Ok. I tested it. My results, you can view at Failed to send event to Redis
:slight_smile: