Help with parsing XML content

jjdepaul · August 15, 2015, 8:31pm

I have a log that has well-formed XML in it. I would like to parse out and extract several of the elements from that XML and publish them as fields in the /sprint/slic, my index/type pair.

Here is the data (two separate lines with \r at the end of each line):

[8/14/15 18:15:18:595] [DEBUG][main] HttpClient.submit - POST : <?xml version="1.0" encoding="UTF-8" standalone="yes"?><BillingAndCosting version="1.0"><ControlArea><**SenderId**>COMMERCIAL-BILLING_48012</SenderId><**WaterMark**>1419098400000</WaterMark><**RecordCount**>2</RecordCount><**TimeStamp**>2014-12-21T07:52:00.446-06:00</TimeStamp></ControlArea></BillingAndCosting>
[8/14/15 18:16:18:595] [DEBUG][main] end of post.

Now here is the Filter section of my Config file in which I'm attempting to provide field mapping from XML to document fields:

filter {
  if [path] =~ "SLIC" {
        mutate { replace => { "type" => "slic" } }
  } else {
        mutate { replace => { "type" => "sysout" } }
  }
  grok {
    match => [
          "message",
          "^\[%{DATESTAMP:**tslice**}\] ... %{GREEDYDATA:**xmldata**}"
    ]
  }
  #if "_grokparsefailure" in [tags] {
  #        drop { }
  #}  
  xml {
      source => "xmldata"
      add_field => { 
            "senderId" => "%{SenderId}" 
            #"waterMark" => "%{WaterMark}"
            #"nbrOfAccounts" => "%{RecordCount}"
            #"eventTimeStamp" => "%{TimeStamp}"
      }
  }
}

The above config did create the xmldata field in my document, but it didn't create the individual data fields from the xml document. Here is what's in the document data now:

  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "sprint",
      "_type" : "slic",
      "_id" : "AU8zBckKlwBsRW10eWY3",
      "_score" : 1.0,
      "_source":{"message":"[8/14/15 18:15:18:595] [DEBUG][main] HttpClient.submit - POST : <?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><BillingAndCosting version=\"1.0\"><ControlArea><SenderId>COMMERCIAL-BILLING_48012</SenderId><WaterMark>1419098400000</WaterMark><RecordCount>2</RecordCount><TimeStamp>2014-12-21T07:52:00.446-06:00</TimeStamp></ControlArea></BillingAndCosting>\r","@version":"1","@timestamp":"2015-08-15T20:21:00.228Z","host":"IBM-EN189AKEUJ4","path":"C:/logstash-1.5.3/demo/SystemSLICxml.log","type":"slic","tslice":"8/14/15 18:15:18:595","xmldata":"<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"yes\"?><BillingAndCosting version=\"1.0\"><ControlArea><SenderId>COMMERCIAL-BILLING_48012</SenderId><WaterMark>1419098400000</WaterMark><RecordCount>2</RecordCount><TimeStamp>2014-12-21T07:52:00.446-06:00</TimeStamp></ControlArea></BillingAndCosting>\r","tags":["_xmlparsefailure"]}
    } ]
  }
}

How come it's not parsing out and creating these 4 fields for me in the Document?
"senderId" => "%{SenderId}"
"waterMark" => "%{WaterMark}"
"nbrOfAccounts" => "%{RecordCount}"
"eventTimeStamp" => "%{TimeStamp}"

magnusbaeck · August 16, 2015, 2:55pm

You're not setting the target parameter, which is required when store_xml is true (which is the default). See github.com/logstash-plugins/logstash-filter-xml issue #9.

You might want to look into the xpath parameter as a means of extracting and renaming fields from the XML. It might be more convenient than adding/moving fields and deleting the rest.

jjdepaul · August 17, 2015, 4:51pm

OK, I've switched to xpath. Logstash is now successfully parsing out the values I need, but they are not present in my Document after successful run... How can I add the four new fields to my doc? A short example would be perfect.

Here is my config:

filter {
  if [path] =~ "SLIC" {
        mutate { replace => { "type" => "slic" } }
  } else {
        mutate { replace => { "type" => "sysout" } }
  }
  grok {
    match => [
          "message",
          "^\[%{DATESTAMP:tslice}\] %{GREEDYDATA} HttpClient.submit - POST : %{GREEDYDATA:inxml}"
    ]
  }
  if "_grokparsefailure" in [tags] {
          drop { }
  }  
  xml {
      source => "inxml"
      target => "xmldata"
      store_xml => "false"
      xpath => ["/BillingAndCosting/ControlArea/SenderId/text()","senderid"]
      xpath => ["/BillingAndCosting/ControlArea/WaterMark/text()","watermark"]
      xpath => ["/BillingAndCosting/ControlArea/RecordCount/text()","nbrofaccounts"]
      xpath => ["/BillingAndCosting/ControlArea/TimeStamp/text()","timeofdelivery"]
  }
}

Also, is there a way to count nodes that may appear in the XML structure, like in this example below. How can that be reflected in the config file?!

  <xsl:for-each select="BillingAndCosting/DataArea/CustomerAccount">
    <p> Account <xsl:value-of select="ExternalKey" /> has <xsl:value-of select="count(BillingData)"/> invoices</p>
  </xsl:for-each>

magnusbaeck · August 18, 2015, 5:35am

How can I add the four new fields to my doc?

That should be done by the xml filter when you're setting xpath. If this doesn't happen, are you sure the xpath expression is correct? Is there anything interesting in the Logstash logs?

target => "xmldata"

This isn't needed now that you're using the xpath parameter and setting store_xml to false.

Also, is there a way to count nodes that may appear in the XML structure, like in this example below. How can that be reflected in the config file?!

If you re-enable store_xml you should be able to write a snippet of Ruby code in a ruby filter to count the fields that result from the XML nodes.

jjdepaul · August 18, 2015, 7:49pm

Ok - I've removed "target" stanza and it came alive. Thx -

I had to remove some of the extra fields that I didn't want to include (message, inxml, xmldata) - I guess the filter will put all defined variables in the DOC by default.

Is Ruby the only code option supported/allowed?! I'm familiar with Java syntax (could probably deal with Scala), but not Ruby... Would like to count the nodes in the more complex XML. Where do I start? Some code samples would be most appreciated.

magnusbaeck · August 18, 2015, 8:05pm

Is Ruby the only code option supported/allowed?!

With the stock filters, yes. Theoretically one could write plugins to support custom filters written in any language.

Would like to count the nodes in the more complex XML. Where do I start?

The event variable contains the event itself, and it behaves mostly like a Ruby hash (equivalent to a Java map). With .to_hash you can turn it into an actual hash containing fields and their values. Hashes as well as arrays are indexed with square brackets and the length of an array can be obtained with .length.

Not knowing what the XML looks like and consequently how the xml filter treats it I can't get into specifics, but if you store the XML in the xmldata field (having store_xml set to true) something similar to this should store the number of nodes in the count field:

filter {
  ruby {
    code => "
      event['count'] = event['xmldata']['BillingAndCosting'][0]['DataArea'][0]['CustomAccount'].length
    "
  }
}

Start by getting the parsed XML into the xmldata field and dumping the result with a stdout output with codec => rubydebug. Note that my code sample above blindly assumes that the XML has the expected form. You'll probably to surround the statement by a begin ... rescue ... end block to swallow any exception raised when trying to access the fields.

jjdepaul · August 18, 2015, 11:51pm

Thank you for the guidance on Ruby code.

I am still having problems with the XML values that have been set in my Document by xml filter. I am getting data in the document, but Kibana does not like it - it crashes with errors when I import billing Index into it and attempt a Search. THe search goes into a tail-spin and never recovers. Upper bar reports errors.

Here is some of the resulting data that I see in ES. The time series is based on tslice field. The critical fields for my data visualization are senderid, watermark, and nbrofaccounts. They look like they each have been put into an array by xml filter, yet the mapping looks to be preserved just like I defined it (simple strings and integer).... not sure what the issue is. Does this document look ok?

{
      "_index" : "billing",
      "_type" : "slic",
      "_id" : "AU9DLwW9VXhZ1mHes35J",
      "_score" : 1.0,
      "_source":{"@version":"1","@timestamp":"2015-08-18T23:39:58.802Z","host":"IBM-EN189AKEUJ4","path":"C:/logstash-1.5.3/demo/SystemSLICxml_bulk.log","type":"slic","tslice":"8/14/15 18:18:18:595","senderid":["COMMERCIAL-BILLING_48017"],"watermark":["1419097910000"],"nbrofaccounts":["11"],"timeofdelivery":["2014-12-22T07:52:00.246-06:00"]}
    } ]
  }
}

The console in logstash looked like this:

{
          "@version" => "1",
        "@timestamp" => "2015-08-18T23:39:59.101Z",
              "host" => "IBM-EN189AKEUJ4",
              "path" => "C:/logstash-1.5.3/demo/SystemSLICxml_bulk.log",
              "type" => "slic",
            "tslice" => "8/18/15 18:21:18:595",
          "senderid" => [
        [0] "COMMERCIAL-BILLING_48020"
    ],
         "watermark" => [
        [0] "1419097940000"
    ],
     "nbrofaccounts" => [
        [0] "15"
    ],
    "timeofdelivery" => [
        [0] "2014-12-22T07:52:00.246-06:00"
    ]
}

magnusbaeck · August 19, 2015, 3:46am

The senderid, watermark, nbrofaccounts, and timeofdelivery fields are arrays. I don't know if that's what's upsetting Kibana but it's not what you want. You should be able to use a mutate filter to rename e.g. [senderid][0] to senderid.

jjdepaul · August 19, 2015, 3:49am

thanks for your reply. What is strange to me is that when I display mappings for that index/type, the fields in question don't appear to be defined as arrays there... I will try it with Mutate.

magnusbaeck · August 19, 2015, 5:27am

That's because mapping-wise there are no arrays. A field that's mapped as, say, a string can either be a scalar string or an array of strings.

jjdepaul · August 20, 2015, 3:09pm

I've added MUTATE instructions to my xml filter, but still, the resulting document fields endup as Arrays in ES.

Here is my config section for XML filter:

filter {
  if [path] =~ "SLIC" {
        mutate { replace => { "type" => "slic" } }
  }
  grok {
    match => [
          "message",
          "^\[%{DATESTAMP:tslice}\] %{GREEDYDATA} HttpClient.submit - POST : %{GREEDYDATA:inxml}"
    ]
  }
  if "_grokparsefailure" in [tags] {
          drop { }
  }  
  xml {
      source => "inxml"
      #target => "xmldata"
      store_xml => "false"
      xpath => ["/BillingAndCosting/ControlArea/SenderId/text()","senderid"]
      xpath => ["/BillingAndCosting/ControlArea/WaterMark/text()","watermark"]
      xpath => ["/BillingAndCosting/ControlArea/RecordCount/text()","nbrofaccounts"]
      xpath => ["/BillingAndCosting/ControlArea/TimeStamp/text()","timeofdelivery"]
  }
   if "_grokparsefailure" in [tags] {
              drop { }
    } else {
        mutate { 
            remove_field => [ "message", "inxml", "xmldata" ] 
        }
        mutate {
        convert => { "senderid" => "string" }
        convert => { "watermark" => "string" }
        convert => { "nbrofaccounts" => "integer" }
        convert => { "timeofdelivery" => "string" }
      }
  }
}

And here is what the resulting document looks like in ES. The 4 fields in question still appear as Arrays in ES - help:

  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "billing",
      "_type" : "slic",
      "_id" : "AU9LolAS9HwUDcv_8GWx",
      "_score" : 1.0,
      "_source":{"@version":"1","@timestamp":"2015-08-20T15:02:51.544Z","host":"IBM-EN189AKEUJ4","path":"C:/logstash-1.5.3/demo/SystemSLICxml.log","type":"slic","tslice":"8/14/15 18:15:18:595","senderid":["COMMERCIAL-BILLING_44886"],"watermark":["1419098400000"],"nbrofaccounts":[32],"timeofdelivery":["2014-12-21T07:52:00.446-06:00"]}
    } ]
  }

magnusbaeck · August 20, 2015, 5:37pm

The mutate filters you have do nothing to flatten the arrays, but this does:

filters {
  replace => {
    "senderid" => "%{[senderid][0]}"
    "watermark" => "%{[watermark][0]}"
    "nbrofaccounts" => "%{[nbrofaccounts][0]}"
    "timeofdelivery" => "%{[timeofdelivery][0]}"
  }
}

jjdepaul · August 20, 2015, 6:17pm

That's the detail I was missing, thank you, thank you! Works now...

jjdepaul · August 21, 2015, 3:50pm

One last question in this space:

I was not sure about the location of the following IF statement:

  if "_grokparsefailure" in [tags] {
          drop { }
  }

Is this something that is bound to each Filter defintion or is that something that's shared by ALL filters that I define? I have placed that IF after grok {} and then after xml {} filters I have defined, but wasn't sure I needed to do so after each one or just once....

magnusbaeck · August 21, 2015, 4:54pm

All filters are processed in order for all messages, unless surrounded by a conditional like in your case. So it's fine for you to have a single conditional drop at the end.

jjdepaul · August 21, 2015, 7:04pm

Thx very much -

Topic		Replies	Views
Logstash xml parsing Logstash	2	623	April 5, 2017
Unable to parse out XML fields Logstash	3	986	August 22, 2017
How extract data with logstash filter xml from a complex xml Logstash	3	932	November 29, 2019
XML XPath filter is parsing fields but not inserting in Elasticsearch Logstash	9	1921	April 11, 2018
Parsing Issue for XML File Logstash	2	530	November 24, 2017

Help with parsing XML content

Related topics