Parsing XML managing arrays and multilines

Hi everyone,

I'm dealing with a huge XML and I'm trying to proceed step-by-step.
For the moment I'm experiencing difficulties with multiline and arrays management by Logstash.

This is the simplified XML I'm trying to parse:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <hostname>crt-mon</hostname>
  <date>2016.11.01</date>
  <time>01:23:04 CET</time>
  <release>11.6</release>
  <version>2.1</version>
</properties>

First thing first, I've tried this Logstash input configuration:

input {
  file {
    path => "/srv/logstash/logs/test_multiline.xml"
    type => "test-xml"
    start_position => "beginning"
    codec => multiline {
      pattern => "^<\?properties .*\>"
      negate => "true"
      what => "previous"
    }
  }
}
filter {
  xml {
    store_xml => "false"
    source => "message"
    xpath => [
      "/properties/hostname/text()", "hostname",
      "/properties/date/text()", "date",
      "/properties/time/text()", "time",
      "/properties/release/text()", "release",
      "/properties/version/text()", "version"
    ]
  }
  mutate {
    replace => {"hostname" => "%{[hostname][0]}" }
    replace => {"date" => "%{[date][0]}" }
    replace => {"time" => "%{[time][0]}" }
    replace => {"release" => "%{[release][0]}" }
    replace => {"version" => "%{[version][0]}" }
  }
}
output { stdout { codec => rubydebug } }

But unfortunately nothing seems to happen, my guess is that Logstash is wayting for the next line, because when I stop the Pipeline I can see that something has been parsed:

{:timestamp=>"2016-10-31T18:54:27.754000+0000", :message=>"Pipeline main started"}
{:timestamp=>"2016-10-31T18:54:38.473000+0000", :message=>"SIGINT received. Shutting down the agent.", :level=>:warn}
{:timestamp=>"2016-10-31T18:54:38.480000+0000", :message=>"stopping pipeline", :id=>"main"}
{
    "@timestamp" => "2016-10-31T18:54:39.073Z",
       "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<properties>\n  <hostname>crt-mon</hostname>\n  <date>2016.11.01</date>\n  <time>01:23:04 CET</time>\n  <release>11.6</release>\n  <version>2.1</version>\n</properties>",
      "@version" => "1",
          "tags" => [
        [0] "multiline"
    ],
          "path" => "/srv/logstash/logs/test_multiline.xml",
          "host" => "4d8280939c35",
          "type" => "test-xml",
      "hostname" => "crt-mon",
          "date" => "2016.11.01",
          "time" => "01:23:04 CET",
       "release" => "11.6",
       "version" => "2.1"
}
{:timestamp=>"2016-10-31T18:54:39.826000+0000", :message=>"Pipeline main has been shutdown"}

So I have manually put the whole XML on a single line, and tried with this input configuration:

input {
  file {
    path => "/srv/logstash/logs/test.xml"
    type => "test-xml"
    start_position => "beginning"
    ignore_older => 0
  }
}

This time the thing is working, but I don't understand why the mutate/replace filter is overwriting my fields with the [fieldname][0] text, where I just want to replace the array generated by the XML filter in a single value:

{:timestamp=>"2016-10-31T19:03:33.614000+0000", :message=>"Pipeline main started"}
{
       "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?><properties><hostname>crt-mon</hostname><date>2016.11.01</date><time>01:23:04 CET</time><release>11.6</release><version>2.1</version></properties>",
      "@version" => "1",
    "@timestamp" => "2016-10-31T19:03:32.036Z",
          "path" => "/srv/logstash/logs/test.xml",
          "host" => "1afc6eb0026b",
          "type" => "test-xml",
      "hostname" => "crt-mon",
          "date" => "2016.11.01",
          "time" => "01:23:04 CET",
       "release" => "11.6",
       "version" => "2.1",
     "timestamp" => "2016.11.01 01:23:04 CET"
}
{
       "message" => "",
      "@version" => "1",
    "@timestamp" => "2016-10-31T19:03:33.624Z",
          "path" => "/srv/logstash/logs/test.xml",
          "host" => "1afc6eb0026b",
          "type" => "test-xml",
      "hostname" => "%{[hostname][0]}",
          "date" => "%{[date][0]}",
          "time" => "%{[time][0]}",
       "release" => "%{[release][0]}",
       "version" => "%{[version][0]}",
     "timestamp" => "%{[date][0]} %{[time][0]}"
}

This is obviously a problem because if I put a date/match filter later in the configuration to parse the timestamp field, I receive a dateparsefailure from Logstash.

At the end of the story, I'm opening this Topic to kindly ask for a comment on these questions:

  • Which is the proper way to let Logstash handle huge multiline XMLs?
  • How is possible to remove arrays when only one element is present?

Regards

Hi, I made some progress by using a rename filter in my Logstash configuration:

mutate {
  rename => [
    "[hostname][0]", "hostname",
    "[date][0]", "date",
    "[time][0]", "time",
    "[release][0]", "release",
    "[version][0]", "version"
  ]
}

And pipeline output:

{:timestamp=>"2016-11-02T21:39:27.950000+0000", :message=>"Pipeline main started"}
{
       "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?><properties><hostname>crt-mon</hostname><date>2016.11.01</date><time>01:23:04 CET</time><release>11.6</release><version>2.1</version></properties>",
      "@version" => "1",
    "@timestamp" => "2016-11-02T21:39:26.341Z",
          "path" => "/srv/logstash/logs/test.xml",
          "host" => "5d79431640db",
          "type" => "test-xml",
      "hostname" => "crt-mon",
          "date" => "2016.11.01",
          "time" => "01:23:04 CET",
       "release" => "11.6",
       "version" => "2.1"
}

My only doubt is how I can manage a multi-line XML, because I'm still using a single-line one.

Ok I've made a lot of tests with the multiline filter and it seems that Logstash is always waiting for the next multiline, so it is impossible to have an entire XML file parsed when multiline is involved and no new lines are coming to the pipeline.