Xml filter array

Hello I have a XML file wich contains 10k+ elements, I played around with the XML filter for logstash wich only gives me arrays like this

"date" => [
[ 0] "Jun 21, 2013 1:48:43 PM",
[ 1] "Apr 22, 2013 12:16:00 PM",
[ 2] "Nov 7, 2012 5:11:03 PM",
[ 3] "Jun 21, 2013 1:45:02 PM",
[ 4] "Nov 7, 2012 4:40:02 PM",
[ 5] "Nov 7, 2012 5:05:08 PM",
[ 6] "Jun 21, 2013 3:29:40 PM",
[ 7] "Jan 9, 2013 2:02:57 PM",
[ 8] "Oct 26, 2012 12:27:21 PM",
[ 9] "Jan 4, 2013 11:26:43 AM",
[10] "Jan 9, 2013 12:16:03 PM",
[11] "Jan 8, 2013 12:01:31 PM",
[12] "Jan 7, 2013 2:53:45 PM",
[13] "Jan 4, 2013 11:30:23 AM",
[14] "Nov 7, 2012 10:37:08 AM",
[15] "Jan 9, 2013 1:57:47 PM",
[16] "Jan 4, 2013 11:33:38 AM",
[17] "Jan 9, 2013 1:57:49 PM",
[18] "Jan 4, 2013 12:05:24 PM",
[19] "Jan 9, 2013 6:17:33 PM",
[20] "Jan 9, 2013 1:57:53 PM",
[21] "Jan 8, 2013 7:52:05 AM",
[22] "Jun 24, 2013 4:31:18 PM"
],

Instead of 5 arrays containing the values, I'd like it to be split into multiple documents with a normal field and value

my pipline looks like this

filter {

  xml {
  source => "message"
  store_xml => "false"
  force_array => "false"
  force_content => "true"
      xpath => ["/memberrevisions/memberrevision/ID/text()","ID"]
      xpath => ["/memberrevisions/memberrevision/author/text()","author"]
      xpath => ["/memberrevisions/memberrevision/desc/text()","desc"]
      xpath => ["/memberrevisions/memberrevision/date/text()","date"]
  }
}

You can use a ruby filter to loop over the elements of each array and join them into an array of objects (each one with an id, author, desc, and date), then use a split filter to turn them into separate events.

Thanks for the answer.

how much will that impact performance?

and i propably need more than a little help with the ruby code.

ruby {  code => "
          event['date'].each do |x|
              x.each do |key, value|
                  x[key] = value[x]
              end
          end"
      }

does that make sense at all?

how much will that impact performance?

Like all filters the performance will be affected, but I have no idea by how much. You'll have to measure.

and i propably need more than a little help with the ruby code.

Look into using transpose(). Form an array out of your arrays and it'll join them for you. Example:

$ irb
irb(main):001:0> [['id1', 'id2', 'id3'], ['author1', 'author2', 'author3'], ['desc1', 'desc2', 'desc3']].transpose
=> [["id1", "author1", "desc1"], ["id2", "author2", "desc2"], ["id3", "author3", "desc3"]]

Perhaps something like this:

event.set('dest-field', [event.get('ID'), event.get('author'), event.get('desc'), event.get('date')].transpose)

Then, loop over the resulting array and use collect to transform each item from a four-element array to a hash:

array_of_hashes = array_of_arrays.collect { |i|
  {'id' => i[0], 'author' => i[1], 'desc' => i[2], 'date' => i[3]}
}

thanks again for the help.

I tried it and it's partially working
I tired it without the loop just accessing the 1st element of the new array, but it throws a ruby exception

ruby {  code => "event.set('mksrevision', [event.get('ID'), event.get('author'), event.get('desc'), event.get('date')].transpose)"}
ruby {  code => "array_of_hashes = mksrevision.collect { |i| {'id' => i[0], 'author' => i[1], 'desc' =>i1[2], 'date' => i[3]}}"}

what am i missing?

You're mixing apples and oranges. event.set and event.get operate on event fields, but mksrevision.collect attempts to access a local variable named mksrevisions and there is no such thing.

Also, there's no reason to use multiple ruby filters.

ruby {
  code => "
    line1
    line2
  "
}

I think i got it, kinda.

  ruby {  code => "
  event.set('mksrevision', [event.get('ID'), event.get('author'), event.get('desc'), event.get('date')].transpose)
  array_of_hashes = event.get('mksrevision').collect { |i| {'id' => i[0], 'author' => i[1], 'desc' => i[2], 'date' => i[3]}}
  "}

  mutate{
    remove_field => ["message", "author", "date", "ID", "desc", "mksrevision"]
    }

no more ruby exception but the stdout rubydebug shows nothing left, no array_of_hashes i mean and i still have to split the array_of_hashes

Again, array_of_hashes is just a local Ruby variable. If you need to process it with other filters you should save it to a field with event.set. mksrevision, on the other hand, does not need to be a field since you don't need that value outside the ruby filter.

I'm getting there step by step.

  ruby {  code => "
  event.set('mksrevision', [event.get('ID'), event.get('author'), event.get('desc'), event.get('date')].transpose)
  array_of_hashes = event.get('mksrevision').collect { |i| {'id' => i[0], 'author' => i[1], 'desc' => i[2], 'date' => i[3]}}
  event.set('mks', array_of_hashes)
  "}

this creates an array "mks" wich basically has the richt structure

I need to parse each array elemnt as a seperate document to elasticsearch.

edit:
got it after that i used

split {
         field => "mks"
     }

now its working, thanks magnus for bearing with me and my none existing ruby knowlege

Yesterday i did not see that the split created a new field wich contains the content.

how can i remove the top level?

   "mks" => {
  "date" => "Jun 24, 2013 4:31:18 PM",
"author" => "WallE",
    "id" => "t:/mks/"
  "desc" => "Initial"
    },

I think you have to move (rename) the fields to the top level with a mutate filter.

since renaming had a pretty big performance impact on a previous pipline, i didn't want use the mutate rename again.

so i came up with a ruby solution instead, wich runs after the split.
another benefint, works on fields without a known name

ruby {
      code => "
        event.get('mks').each {|k, v|
          event.set(k,v)
        }
        event.remove('message')
      "
    }

thanks again magnus for all the help

Hi @Lukas_Tilch

I am using your example, i am able to get the mks value after transposing it.

Like this below. But i am not able to split the 'mks'.

After using split 'mks' i am not getting the mks field itself.

Can you please share your conf file, that will be very helpful. Thanks

Also i am achieving only 1 document. How can i split into multiple events to get multiple documents.

I am using split filter but it is not working, even though i have set the event as mks

{
	"id" : "1",
	"author" : "author1",
	"desc" : "Description for 1",
	"date" : "Jun 18, 2017 1:48:43 PM"
}, {
	"id" : "2",
	"author" : "author2",
	"desc" : "Description for 2",
	"date" : "Jun 21, 2017 1:48:43 PM"
}

hey, the way xml files are handled by logstash dind't really fit my needs. i converted the xml document to a json with powershell.
nonetheless i can give you my conf.
filter {

mutate {
        gsub => ["message", "\n", ""]
 }

  xml {
  source => "message"
  store_xml => "false"
  force_array => "false"
  #force_content => "true"
      xpath => ["/memberrevisions/memberrevision/ID/text()","ID"]
      xpath => ["/memberrevisions/memberrevision/author/text()","author"]
      xpath => ["/memberrevisions/memberrevision/desc/text()","desc"]
      xpath => ["/memberrevisions/memberrevision/date/text()","date"]
      xpath => ["/memberrevisions/memberrevision/revision/text()","rev"]
      xpath => ["/memberrevisions/memberrevision/name/text()","name"]
      xpath => ["/memberrevisions/memberrevision/path/text()","pathe"]
  }

  ruby {  code => "
  event.set('mksrevision', [event.get('ID'), event.get('author'), event.get('desc'), event.get('date'), event.get('rev'), event.get('name'), event.get('pathe')].transpose)
  array_of_hashes = event.get('mksrevision').collect { |i| {'ID' => i[0], 'author' => i[1], 'desc' => i[2], 'date' => i[3], 'rev' => i[4], 'name' => i[5], 'pathe' => i[6]}}
  event.set('mks', array_of_hashes)
  "}
  split {
        field => "mks"
  }

 
#this is a different approach for differently strucuted xml. can be ignored
#mutate {
    #rename => { "[ID][0]" => "ID" }
    #rename => { "[author][0]" => "author" }
    #rename => { "[desc][0]" => "desc" }
    #rename => { "[date][0]" => "date" }
    #rename => { "[rev][0]" => "rev" }
    #rename => { "[name][0]" => "name" }
    #rename => { "[pathe][0]" => "pathe" }
    #}


#  replace => {
#      "id" => "%{[id][0]}"
#      "author" => "%{[author][0]}"
#      "desc" => "%{[desc][0]}"
#      "date" => "%{[date][0]}"
#      "rev" => "%{[rev][0]}"
#      "name" => "%{[name][0]}"
#      "pathe" => "%{[pathe][0]}"
#    }



  ruby {
      code => "
        event.get('mks').each {|k, v|
          event.set(k,v)
        }
        event.remove('message')
      "
    }





#    date{
#    match => ["date", "MMM d, yyyy h:mm:ss a"]
#    target => "date"
#      }

    mutate{
    remove_field => ["message", "mks", "mksrevision"]
    }
}

output {
  stdout {
    codec => rubydebug
  }
  elasticsearch {
  hosts => [ "hcd1515g:9200" ]
    action => "update"
    doc_as_upsert => "true"
    document_id => "%{ID}"
    document_type => "mks-revision"
    index => "logstash-mks-file-revisions"
  }
}
1 Like

Hi

Thanks for this config file.
Still i am not able to split the 'mks', i am getting same issue as previous..

After using split, i get the only last record. It doesn't split into multiple document.
Please let me know, if you got the multiple events after using the split
split {
field => "mks"
}

have you tried this?

ruby {
code => "
event.get('mks').each {|k, v|
event.set(k,v)
}
event.remove('message')
"
}

Maybe it has something to do with how your xml is structured?

Hi

Thanks for responding.
Yes i have tried this, this block is only for setting key and value and this comes after split.

My problem is split itself is not working.
After applying split code, i get only 1 event , which is the last event. Looks like some looping is not happening

I am able to get mks event successful.

The structure of xml is exactly same as yours.

Can you please tell me, if you were able to split in multiple documents/events after applying split

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.