111126
(ΠΠΈΡ
Π°ΠΈΠ»)
January 29, 2017, 9:24pm
1
Hello.
I have 1 million file with xml. Each file size from a few kilobytes to 20mb.
I need parsing their and put to elastic. I tried to use the next config:
file {
path => "/opt/lun1/data-unzip/ftp/228-pur-plan/.xml "
exclude => "*.zip"
type => "228-pur-plan"
start_position => beginning
max_open_files => "64"
close_older => "0.1"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "999999"
max_bytes => "20 MiB"
}
}
filter {
if [type] =~ "228-pur-plan" {
xml {
store_xml => false
source => "message"
xpath => [
<about_100_lines>
]
}
output {
if [type] =~ "228-pur-plan" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "228-pur-plan-%{+YYYY.MM}"
template => "/opt/logstash/custom_patterns/elasticsearch-template_228.json"
template_name => "228"
}
}
}
My hardware server use CentOS 6.8 and have:
80Gb memory
Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
with 16 cpus.
I set logstash heap size 32gb.
But it work so slowly! I have very bad result - 4 file parses per second.
Please, how I do more quickly ? May be I must do any tuning ? OR may be I should use other input plugin?
1 Like
warkolm
(Mark Walkom)
January 30, 2017, 5:03am
2
Try increasing the pipeline worker count.
Also, is it LS or ES that is the bottleneck?
How can I check it?
What's the limiting factor? How many files per second can Logstash process alone? How many documents per second can Elasticsearch accept? Measure.
I think a major issue is that you're using a single file input which is going to run in a single thread. Splitting it into multiple file inputs is likely to help, but if you really want to improve the performance I suspect writing a program that bulk-reads XML files and feeds them to Logstash via e.g. TCP would perform much better.
1 Like
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
January 30, 2017, 7:03am
5
ΠΠ. I tried tcp input too, but I have error
[FATAL][logstash.runner ] An unexpected error occurred! {:error=>#<Nokogiri::XML::XPath::SyntaxError: /ns2:purchasePlan/[local-name()='header']/ [local-name()='guid']/text()>, :backtrace=>["nokogiri/XmlXpathContext.java:169:in `evaluate'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:165:in `xpath'", "org/jruby/RubyArray.java:2414:in `map'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:156:in `xpath'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:153:in `filter'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:152:in `filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in `do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:in `multi_filter'", "org/jruby/RubyArray.java:1613:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:in `multi_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:41:in `multi_filter'", "(eval):155:in `initialize'", "org/jruby/RubyArray.java:1613:in `each'", "(eval):152:in `initialize'", "org/jruby/RubyProc.java:281:in `call'", "(eval):127:in `filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:295:in `filter_batch'", "org/jruby/RubyProc.java:281:in `call'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:192:in `each'", "org/jruby/RubyHash.java:1342:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:191:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:294:in `filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:282:in `worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:258:in `start_workers'"]}
My tcp config;
tcp {
host => "0.0.0.0"
port => 7101
type => "228-pur-plan"
codec => multiline {
pattern => "xml version"
negate => true
what => "previous"
max_lines => "999999"
max_bytes => "20 MiB"
}
Always post logs as preformatted text. As it stands the interesting error message has been mangled and isn't visible.
Don't use the multiline codec. Whatever feeds Logstash via TCP should slurp the whole file in one go, removing newline characters if necessary.
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
January 30, 2017, 8:47am
7
I try to use without multiline codec:
tcp {
host => "0.0.0.0"
port => "7101"
type => "28-pur-plan"
}
filter {
if [type] =~ "228-pur-plan" {
xml {
store_xml => false
source => "message"
xpath => [
about 100 lines
]
}
output {
if [type] =~ "228-pur-plan" {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "228-pur-plan-%{+YYYY.MM}"
template => "/opt/logstash/custom_patterns/elasticsearch-template_228.json"
template_name => "228"
}
}
}
And I have the next error messages:
[2017-01-30T11:41:03,942][ERROR][logstash.pipeline ] Exception in pipelineworker, the pipeline stopped processing new events, please check your filter configuration and restart Logstash. {"exception"=>#<Nokogiri::XML::XPath::SyntaxError: /ns2:purchasePlan/*[local-name()='header']/*[local-name()='guid']/text()>, "backtrace"=>["nokogiri/XmlXpathContext.java:169:in
evaluate'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:165:in xpath'", "org/jruby/RubyArray.java:2414:in
map'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.6.8.1-java/lib/nokogiri/xml/searchable.rb:156:in xpath'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:153:in
filter'", "org/jruby/RubyHash.java:1342:in each'", "/usr/share/logstash/vendor/bundle/jruby/1.9/gems/logstash-filter-xml-4.0.2/lib/logstash/filters/xml.rb:152:in
filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:145:in do_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:164:in
multi_filter'", "org/jruby/RubyArray.java:1613:in each'", "/usr/share/logstash/logstash-core/lib/logstash/filters/base.rb:161:in
multi_filter'", "/usr/share/logstash/logstash-core/lib/logstash/filter_delegator.rb:41:in multi_filter'", "(eval):155:in
initialize'", "org/jruby/RubyArray.java:1613:in each'", "(eval):152:in
initialize'", "org/jruby/RubyProc.java:281:in call'", "(eval):127:in
filter_func'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:295:in filter_batch'", "org/jruby/RubyProc.java:281:in
call'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:192:in each'", "org/jruby/RubyHash.java:1342:in
each'", "/usr/share/logstash/logstash-core/lib/logstash/util/wrapped_synchronous_queue.rb:191:in each'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:294:in
filter_batch'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:282:in worker_loop'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:258:in
start_workers'"]}`
Okay, so the xml filter thinks it gets garbage input. I suggest you remove newline characters from the input so that what you send to Logstash contains one file per line, but that's just a guess at what's wrong. Look at the event the xml filter is complaining about.
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
January 30, 2017, 1:16pm
9
I do it. I remove newline characters and forward to logstash throw TCP for 3 inputs
tcp {
host => "0.0.0.0"
port => "7101"
type => "28-pur-plan"
}
tcp {
host => "0.0.0.0"
port => "7102"
type => "28-pur-plan"
}
tcp {
host => "0.0.0.0"
port => "7103"
type => "28-pur-plan"
}
But I have bad result: 4-10 xml per second
I tried output to file (instead elasticsearch) but I have 4-10 xml per second too.
Where is my bottelneck ? I think this is logstash, but where?
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
January 30, 2017, 2:09pm
10
I tried to remove xpath xml parsing, and i have good result - over 100 xml rep sec.
Parsing is a bottleneck. How can I optimize xpath parsing or increase the amount of resources allocated to parse?
guyboertje
(Guy Boertje)
January 30, 2017, 3:28pm
11
The XML support in Logstash is done by an external library which is fine for small use cases but it is very slow for xpath and other complex use cases.
You could consider an XML/XSLT to JSON preprocessor before Logstash. Example: https://github.com/bramstein/xsltjson
1 Like
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
February 1, 2017, 6:42am
12
OK. I convert my xml to json and test logstash parsing. My config:
input{
unix {
path => "/tmp/socket1"
mode => "server"
type => "228-pur-plan-soc"
codec => "json_lines"
}
unix {
path => "/tmp/socket2"
mode => "server"
type => "228-pur-plan-soc"
codec => "json_lines"
}
unix {
path => "/tmp/socket3"
mode => "server"
type => "228-pur-plan-soc"
codec => "json_lines"
}
tcp {
host => "0.0.0.0"
port => "7101"
type => "228-pur-plan-tcp"
codec => "json_lines"
}
tcp {
host => "0.0.0.0"
port => "7102"
type => "228-pur-plan-tcp"
codec => "json_lines"
}
tcp {
host => "0.0.0.0"
port => "7103"
type => "228-pur-plan-tcp"
codec => "json_lines"
}
}
filter {
if [type] =~ "228-pur-plan-soc" {
json { source => "message" }
metrics {
meter => "soc_events"
add_tag => "metrics"
}
}
if [type] =~ "228-pur-plan-tcp" {
json { source => "message" }
metrics {
meter => "tcp_events"
add_tag => "metrics"
}
}
}
output {
if [type] =~ "228-pur-plan" {
file {
path => "/opt/lun1/out.dat"
}
}
if "metrics" in [tags] {
elasticsearch {
hosts => "127.0.0.1:9200"
index => "ls_metrics-%{+YYYY.MM}"
}
}
}
And I again got bad results. How can I take more speed?
Hi_pay
(Hi Pay)
February 1, 2017, 10:14am
13
Beats ΠΏΡΠΎΠ±ΠΎΠ²Π°Π» Π½Π°ΡΡΡΠΎΠΈΡΡ?
ΠΠ½ ΠΏΡΠ΅Π΄Π½Π°Π·Π½Π°ΡΠ΅Π½ ΠΊΠ°ΠΊ ΡΠ°Π· Π΄Π»Ρ ΡΡΠ΅Π½ΠΈΡ ΡΠ°ΠΉΠ»ΠΎΠ².
Hi_pay
(Hi Pay)
February 1, 2017, 10:23am
15
You tried to set up the Beats (Filebeat)?
It is designed just for reading files and sending in ES
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
February 1, 2017, 11:45am
16
Do you mean that my problem with input?
Hi_pay
(Hi Pay)
February 1, 2017, 12:20pm
17
What speed do you expect and what is the speed of your hard drive?
ES is on the same server where the files are read?
4-10 xml per second ~ 80-200 Mb/s - average hdd speed
perhaps you used ssd?
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
February 1, 2017, 12:29pm
18
I expect more than 50 json (or xml) per sec.
I tried test in another server (Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz with 40 CPUs), and I have the same result - 4-10 json per sec.
guyboertje
(Guy Boertje)
February 2, 2017, 5:48pm
19
If you want to test whether ES is your bottleneck you could try the redis-output talking to Redis. This will use bulk updates and networking and therefore approximates the ES output and downstream.
111126
(ΠΠΈΡ
Π°ΠΈΠ»)
February 2, 2017, 8:56pm
20
Ok. I tested it. My results, you can view at Failed to send event to Redis