Parsing xml multiline file

davidcsi · January 17, 2018, 1:55am

Hello guys,

I'm trying to parse xml files from S3. I get the file just fine, but i want the xml file to be processed completely like:

  <?xml version="1.0"?>
  <cdr core-uuid="ba3625bb-e210-4a49-971f-efe02a7dafc4" switchname="myswitch">
    <channel_data>
      <state>CS_REPORTING</state>
      <direction>inbound</direction>
      <state_number>11</state_number>
            <flags>0=1;1=1;3=1;35=1;37=1;38=1;40=1;43=1;53=1;75=1;108=1;109=1;110=1;111=1;112=1;113=1;122=1</flags>
      <caps>1=1;2=1;3=1;4=1;5=1;6=1</caps>
    </channel_data>
  ...
  </cdr>

but for some reason it doesn't seem to be running properly, it looks like it's parsing line by line... not sure how to go about this,,,

  {  "channel_data.state": "CS_REPORTING",
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",  
"channel_data.state_number": 11,
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",  "channel_data.state_number": 11,  "channel_data.flags": "0=1;1=1;3=1;35=1;37=1;38=1;40=1;43=1;53=1;75=1;106=1;112=1;113=1;122=1",
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",  "channel_data.state_number": 11,  "channel_data.flags": "0=1;1=1;3=1;35=1;37=1;38=1;40=1;43=1;53=1;75=1;106=1;112=1;113=1;122=1",  "channel_data.caps": "1=1;2=1;3=1;4=1;5=1;6=1",

and i want it to insert on separate fields, setting every field as its proper type (string/int/etc/date)...

My config is:

input {
	s3 {
		access_key_id => "MY-ACCESSKEY-ID"
		secret_access_key => "MY-ACCESS-KEY"
		region => "us-east-1"
		bucket => "mybucket"
		tags => [ "elasticsearch", "logstash", "kibana" ]
		type => "elb"
		prefix => "2017-12-10/"
		sincedb_path => "/var/lib/logstash/.sincedb_argo_elb"
		temporary_directory => "/tmp/logstash/input_argo_elb"
		codec => multiline {
            pattern => "<cdr>|</cdr>"
            negate => true
            what => "next"
            auto_flush_interval => 0
            max_lines => 1000
        }
	}
}

filter {
	ruby {
		code => 'require "nokogiri"
def is_number? string
  true if Float(string) rescue false
end

File.open("/tmp/mydebug.log","a") { |f| f.puts event.get("message") }
doc = Nokogiri::XML.parse(event.get("message"))
leaves = doc.xpath("//cdr")
myglobal = "{"
leaves.each do |node|
  comma = ""
  last_index = doc.xpath("//*[not(*)][text()]").size.pred
  puts doc.xpath("//*[not(*)][text()]").map.with_index{ |n,i|
  if n.name != "DP_MATCH"

    if is_number?(n.text)
      myglobal = myglobal + "  \"#{n.parent.name}.#{n.name}\": #{n.text}"
    else
      myglobal = myglobal + "  \"#{n.parent.name}.#{n.name}\": \"#{n.text}\""
    end

    if i == last_index
      myglobal = myglobal + "}"
      event.set("message", myglobal)
      File.open("/tmp/mydebug.log","a") { |f| f.puts myglobal }
    else
      myglobal = myglobal +  ","
    end

  end
}
end
myglobal = ""
		'
	}
}

output {
        elasticsearch { hosts => ["cdr-elastic:9200"] }
        stdout { codec => rubydebug }
}

Help is greatly appreciated!

David

guyboertje · January 18, 2018, 2:38pm

Sorry to be the bearer of bad news - its not really possible to do multiline like this with the S3 input.

Autoflush is not supported with the S3 input.

davidcsi · January 18, 2018, 3:01pm

Thanks for the reply!

Sorry to hear that...

But what about, if I get the files via aws cli and store it locally, then log stash reads file by file, would that work?

guyboertje · January 24, 2018, 2:21pm

Your suggestion has a better chance of working.

Multiline XML and JSON one-document-per-file sources are really problematic for Logstash but only if the file does not have a final newline after the "closing" characters. The clue here in in the name multiline, this is because the multiline codec will think it has only received part of a line (in this transmission) when it sees </cdr> without a newline. Autoflush was designed with this scenario in mind. Rule of thumb, if you have pretty printed XML or JSON in a source then there should be a final newline as the last character or you will need autoflush (only works on the file input though).

Some inputs read a source in chunks of 8K or 16K and the last character in the chunk is almost always from the middle of the line so the line oriented codecs do not process the partial line unil it receives the rest of the line in the next chunk - but (because of file tailing) it never knows if/when the next chunk is going to be received so it (without autoflush) waits indefinitely.

system · February 21, 2018, 2:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Need help with parsing multiline XML file Logstash	4	2894	July 6, 2017
Parsing XML managing arrays and multilines Logstash	3	1705	July 6, 2017
Logstash xml parsing Logstash	2	623	April 5, 2017
XML multiline files Logstash	7	565	November 26, 2020
XML Confusion Logstash	2	486	April 12, 2017

Parsing xml multiline file

Related topics