How to quikly parsing 1 million xml

111126 · January 29, 2017, 9:24pm

Hello.
I have 1 million file with xml. Each file size from a few kilobytes to 20mb.
I need parsing their and put to elastic. I tried to use the next config:

file {
path => "/opt/lun1/data-unzip/ftp/228-pur-plan/.xml"
exclude => "*.zip"
type => "228-pur-plan"
start_position => beginning
max_open_files => "64"
close_older => "0.1"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "999999"
max_bytes => "20 MiB"
}
}

filter {
if [type] =~ "228-pur-plan" {
xml {
store_xml => false
source => "message"
xpath => [
<about_100_lines>
]
}

output {
if [type] =~ "228-pur-plan" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "228-pur-plan-%{+YYYY.MM}"
template => "/opt/logstash/custom_patterns/elasticsearch-template_228.json"
template_name => "228"
}
}
}

My hardware server use CentOS 6.8 and have:
80Gb memory
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
with 16 cpus.

I set logstash heap size 32gb.

But it work so slowly! I have very bad result - 4 file parses per second.
Please, how I do more quickly ? May be I must do any tuning ? OR may be I should use other input plugin?

warkolm · January 30, 2017, 5:03am

Try increasing the pipeline worker count.
Also, is it LS or ES that is the bottleneck?

111126 · January 30, 2017, 5:55am

How can I check it?

magnusbaeck · January 30, 2017, 6:32am

How can I check it?

What's the limiting factor? How many files per second can Logstash process alone? How many documents per second can Elasticsearch accept? Measure.

I think a major issue is that you're using a single file input which is going to run in a single thread. Splitting it into multiple file inputs is likely to help, but if you really want to improve the performance I suspect writing a program that bulk-reads XML files and feeds them to Logstash via e.g. TCP would perform much better.

111126 · January 30, 2017, 7:03am

ОК. I tried tcp input too, but I have error

[FATAL][logstash.runner ] An unexpected error occurred! {:error=>#<Nokogiri::XML::XPath::SyntaxError: /ns2:purchasePlan/[local-name()='header']/[local-name()='guid']/text()>, :backtrace=>["nokogiri/XmlXpathContext.java:169:in `evaluate'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:165:in `xpath'", "org/jruby/RubyArray.java:2414:in `map'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:156:in `xpath'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:153:in `filter'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:152:in `filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in `do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:in `multi_filter'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:in `multi_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:41:in `multi_filter'", "(eval):155:in `initialize'", "org/jruby/RubyArray.java:1613:in `each'", "(eval):152:in `initialize'", "org/jruby/RubyProc.java:281:in `call'", "(eval):127:in `filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:295:in `filter_batch'", "org/jruby/RubyProc.java:281:in `call'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:192:in `each'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:191:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:294:in `filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:282:in `worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:258:in `start_workers'"]}

My tcp config;

tcp {
host => "0.0.0.0"
port => 7101
type => "228-pur-plan"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "999999"
max_bytes => "20 MiB"
}

magnusbaeck · January 30, 2017, 7:05am

Always post logs as preformatted text. As it stands the interesting error message has been mangled and isn't visible.

Don't use the multiline codec. Whatever feeds Logstash via TCP should slurp the whole file in one go, removing newline characters if necessary.

111126 · January 30, 2017, 8:47am

I try to use without multiline codec:

tcp {
host => "0.0.0.0"
port => "7101"
type => "28-pur-plan"
}

filter {
if [type] =~ "228-pur-plan" {
xml {
store_xml => false
source => "message"
xpath => [
about 100 lines
]
}

output {
if [type] =~ "228-pur-plan" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "228-pur-plan-%{+YYYY.MM}"
template => "/opt/logstash/custom_patterns/elasticsearch-template_228.json"
template_name => "228"
}
}
}

And I have the next error messages:

[2017-01-30T11:41:03,942][ERROR][logstash.pipeline ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>#<Nokogiri::XML::XPath::SyntaxError: /ns2:purchasePlan/*[local-name()='header']/*[local-name()='guid']/text()>, "backtrace"=>["nokogiri/XmlXpathContext.java:169:in evaluate'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:165:in xpath'", "org/jruby/RubyArray.java:2414:in map'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:156:in xpath'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:153:in filter'", "org/jruby/RubyHash.java:1342:in each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:152:in filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:in multi_filter'", "org/jruby/RubyArray.java:1613:in each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:in multi_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:41:in multi_filter'", "(eval):155:in initialize'", "org/jruby/RubyArray.java:1613:in each'", "(eval):152:in initialize'", "org/jruby/RubyProc.java:281:in call'", "(eval):127:in filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:295:in filter_batch'", "org/jruby/RubyProc.java:281:in call'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:192:in each'", "org/jruby/RubyHash.java:1342:in each'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:191:in each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:294:in filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:282:in worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:258:in start_workers'"]}`

magnusbaeck · January 30, 2017, 8:55am

Okay, so the xml filter thinks it gets garbage input. I suggest you remove newline characters from the input so that what you send to Logstash contains one file per line, but that's just a guess at what's wrong. Look at the event the xml filter is complaining about.

111126 · January 30, 2017, 1:16pm

I do it. I remove newline characters and forward to logstash throw TCP for 3 inputs

tcp {
host => "0.0.0.0"
port => "7101"
type => "28-pur-plan"
}
tcp {
host => "0.0.0.0"
port => "7102"
type => "28-pur-plan"
}
tcp {
host => "0.0.0.0"
port => "7103"
type => "28-pur-plan"
}

But I have bad result: 4-10 xml per second
I tried output to file (instead elasticsearch) but I have 4-10 xml per second too.

Where is my bottelneck ? I think this is logstash, but where?

111126 · January 30, 2017, 2:09pm

I tried to remove xpath xml parsing, and i have good result - over 100 xml rep sec.
Parsing is a bottleneck. How can I optimize xpath parsing or increase the amount of resources allocated to parse?

guyboertje · January 30, 2017, 3:28pm

The XML support in Logstash is done by an external library which is fine for small use cases but it is very slow for xpath and other complex use cases.

You could consider an XML/XSLT to JSON preprocessor before Logstash. Example: https://github.com/bramstein/xsltjson

111126 · February 1, 2017, 6:42am

OK. I convert my xml to json and test logstash parsing. My config:

input{
        unix {
                path => "/tmp/socket1"
                mode => "server"
                type => "228-pur-plan-soc"
                codec => "json_lines"
        }
        unix {
                path => "/tmp/socket2"
                mode => "server"
                type => "228-pur-plan-soc"
                codec => "json_lines"
        }
        unix {
                path => "/tmp/socket3"
                mode => "server"
                type => "228-pur-plan-soc"
                codec => "json_lines"
        }
        tcp {
                host => "0.0.0.0"
                port => "7101"
                type => "228-pur-plan-tcp"
                codec => "json_lines"
        }
        tcp {
                host => "0.0.0.0"
                port => "7102"
                type => "228-pur-plan-tcp"
                codec => "json_lines"
        }
        tcp {
                host => "0.0.0.0"
                port => "7103"
                type => "228-pur-plan-tcp"
                codec => "json_lines"
        }
}

filter {
   if [type] =~ "228-pur-plan-soc" {
        json { source => "message"  }
        metrics {
                meter => "soc_events"
                add_tag => "metrics"
        }
   }

   if [type] =~ "228-pur-plan-tcp" {
        json { source => "message"  }
        metrics {
                meter => "tcp_events"
                add_tag => "metrics"
        }
   }

}

output {
         if [type] =~ "228-pur-plan" {
                file {
                    path => "/opt/lun1/out.dat"
                }
       }
        if "metrics" in [tags] {
                elasticsearch {
                        hosts => "127.0.0.1:9200"
                        index => "ls_metrics-%{+YYYY.MM}"
                }
        }
}

And I again got bad results. How can I take more speed?

Hi_pay · February 1, 2017, 10:14am

Beats пробовал настроить?
Он предназначен как раз для чтения файлов.

magnusbaeck · February 1, 2017, 10:15am

English please.

Hi_pay · February 1, 2017, 10:23am

You tried to set up the Beats (Filebeat)?
It is designed just for reading files and sending in ES

111126 · February 1, 2017, 11:45am

Do you mean that my problem with input?

Hi_pay · February 1, 2017, 12:20pm

What speed do you expect and what is the speed of your hard drive?
ES is on the same server where the files are read?

4-10 xml per second ~ 80-200 Mb/s - average hdd speed
perhaps you used ssd?

111126 · February 1, 2017, 12:29pm

I expect more than 50 json (or xml) per sec.

I tried test in another server (Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz with 40 CPUs), and I have the same result - 4-10 json per sec.

guyboertje · February 2, 2017, 5:48pm

If you want to test whether ES is your bottleneck you could try the redis-output talking to Redis. This will use bulk updates and networking and therefore approximates the ES output and downstream.

111126 · February 2, 2017, 8:56pm

Ok. I tested it. My results, you can view at Failed to send event to Redis

Topic		Replies	Views
Read and filter 1million xml files Logstash	6	1012	October 6, 2017
Logstash not process large XML with multiple xpath Logstash	4	603	October 2, 2019
Parsing big xml files in logstash Logstash	1	250	November 30, 2022
Small Scale elasticsearch implementation Elasticsearch	2	445	July 6, 2017
Improve performance in a single-node deployment Logstash	1	355	October 22, 2018

How to quikly parsing 1 million xml

Related topics