Parsing xml multiline file

Hello guys,

I'm trying to parse xml files from S3. I get the file just fine, but i want the xml file to be processed completely like:

  <?xml version="1.0"?>
  <cdr core-uuid="ba3625bb-e210-4a49-971f-efe02a7dafc4" switchname="myswitch">
    <channel_data>
      <state>CS_REPORTING</state>
      <direction>inbound</direction>
      <state_number>11</state_number>
            <flags>0=1;1=1;3=1;35=1;37=1;38=1;40=1;43=1;53=1;75=1;108=1;109=1;110=1;111=1;112=1;113=1;122=1</flags>
      <caps>1=1;2=1;3=1;4=1;5=1;6=1</caps>
    </channel_data>
  ...
  </cdr>

but for some reason it doesn't seem to be running properly, it looks like it's parsing line by line... not sure how to go about this,,,

  {  "channel_data.state": "CS_REPORTING",
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",  
"channel_data.state_number": 11,
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",  "channel_data.state_number": 11,  "channel_data.flags": "0=1;1=1;3=1;35=1;37=1;38=1;40=1;43=1;53=1;75=1;106=1;112=1;113=1;122=1",
  {  "channel_data.state": "CS_REPORTING",  "channel_data.direction": "inbound",  "channel_data.state_number": 11,  "channel_data.flags": "0=1;1=1;3=1;35=1;37=1;38=1;40=1;43=1;53=1;75=1;106=1;112=1;113=1;122=1",  "channel_data.caps": "1=1;2=1;3=1;4=1;5=1;6=1",

and i want it to insert on separate fields, setting every field as its proper type (string/int/etc/date)...

My config is:

input {
	s3 {
		access_key_id => "MY-ACCESSKEY-ID"
		secret_access_key => "MY-ACCESS-KEY"
		region => "us-east-1"
		bucket => "mybucket"
		tags => [ "elasticsearch", "logstash", "kibana" ]
		type => "elb"
		prefix => "2017-12-10/"
		sincedb_path => "/var/lib/logstash/.sincedb_argo_elb"
		temporary_directory => "/tmp/logstash/input_argo_elb"
		codec => multiline {
            pattern => "<cdr>|</cdr>"
            negate => true
            what => "next"
            auto_flush_interval => 0
            max_lines => 1000
        }
	}
}

filter {
	ruby {
		code => 'require "nokogiri"
def is_number? string
  true if Float(string) rescue false
end

File.open("/tmp/mydebug.log","a") { |f| f.puts event.get("message") }
doc = Nokogiri::XML.parse(event.get("message"))
leaves = doc.xpath("//cdr")
myglobal = "{"
leaves.each do |node|
  comma = ""
  last_index = doc.xpath("//*[not(*)][text()]").size.pred
  puts doc.xpath("//*[not(*)][text()]").map.with_index{ |n,i|
  if n.name != "DP_MATCH"

    if is_number?(n.text)
      myglobal = myglobal + "  \"#{n.parent.name}.#{n.name}\": #{n.text}"
    else
      myglobal = myglobal + "  \"#{n.parent.name}.#{n.name}\": \"#{n.text}\""
    end

    if i == last_index
      myglobal = myglobal + "}"
      event.set("message", myglobal)
      File.open("/tmp/mydebug.log","a") { |f| f.puts myglobal }
    else
      myglobal = myglobal +  ","
    end

  end
}
end
myglobal = ""
		'
	}
}

output {
        elasticsearch { hosts => ["cdr-elastic:9200"] }
        stdout { codec => rubydebug }
}

Help is greatly appreciated!

David

Sorry to be the bearer of bad news - its not really possible to do multiline like this with the S3 input.

Autoflush is not supported with the S3 input.

Thanks for the reply!

Sorry to hear that...

But what about, if I get the files via aws cli and store it locally, then log stash reads file by file, would that work?

Your suggestion has a better chance of working.

Multiline XML and JSON one-document-per-file sources are really problematic for Logstash but only if the file does not have a final newline after the "closing" characters. The clue here in in the name multiline, this is because the multiline codec will think it has only received part of a line (in this transmission) when it sees </cdr> without a newline. Autoflush was designed with this scenario in mind. Rule of thumb, if you have pretty printed XML or JSON in a source then there should be a final newline as the last character or you will need autoflush (only works on the file input though).

Some inputs read a source in chunks of 8K or 16K and the last character in the chunk is almost always from the middle of the line so the line oriented codecs do not process the partial line unil it receives the rest of the line in the next chunk - but (because of file tailing) it never knows if/when the next chunk is going to be received so it (without autoflush) waits indefinitely.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.