Using logstash

I am trying to use the logstash documentation to set up a xml filter, but I just cant seem to get it right.

My XML format is pretty straigth forward;

@fereshteh,

XML parsing should work easily, even with multiline. Consider the following input (xml.log):

<html lang="en" class="sample">
  <head mycompany="http://url.me/profile#">
    <meta charset='utf-8'>
       <mytag mykey='myval'/>
    </meta>
  </head>
</html>

<html lang="en" class="sample">
  <head mycompany="http://url.me/profile#">
    <meta charset='utf-8'>
       <mytag mykey2='myval2'/>
    </meta>
  </head>
</html>

And this Logstash Config:

input
{
        file
        {
                path => "/sample-inputs/xml.log"
                sincedb_path => "/dev/null"
                start_position => "beginning"
        }
}

filter
{
        multiline
        {
                pattern => "^<html"
                negate => true
                what => previous
        }
        xml
        {
                source => [ "message" ]
                target => [ "x" ]
        }
}

output
{
        stdout
        {
                codec => rubydebug
        }
}

Produces:

{
       "message" => "<html lang=\"en\" class=\"sample\">\n  <head mycompany=\"http://url.me/profile#\">\n    <meta charset='utf-8'>\n       <mytag mykey='myval'/> \n    </meta>\n  </head>\n</html>\n",
      "@version" => "1",
    "@timestamp" => "2015-10-19T13:54:45.774Z",
          "tags" => [
        [0] "multiline"
    ],
             "x" => {
         "lang" => "en",
        "class" => "sample",
         "head" => [
            [0] {
                "mycompany" => "http://url.me/profile#",
                     "meta" => [
                    [0] {
                        "charset" => "utf-8",
                          "mytag" => [
                            [0] {
                                "mykey" => "myval"
                            }
                        ]
                    }
                ]
            }
        ]
    }
}
{
       "message" => "<html lang=\"en\" class=\"sample\">\n  <head mycompany=\"http://url.me/profile#\">\n    <meta charset='utf-8'>\n       <mytag mykey2='myval2'/> \n    </meta>\n  </head>\n</html>",
      "@version" => "1",
    "@timestamp" => "2015-10-19T13:54:45.777Z",
          "tags" => [
        [0] "multiline"
    ],
             "x" => {
         "lang" => "en",
        "class" => "sample",
         "head" => [
            [0] {
                "mycompany" => "http://url.me/profile#",
                     "meta" => [
                    [0] {
                        "charset" => "utf-8",
                          "mytag" => [
                            [0] {
                                "mykey2" => "myval2"
                            }
                        ]
                    }
                ]
            }
        ]
    }
}

@fereshteh,

You can do that too. If you want to parse out the attributes by XPath, do it like this:

 xml
        {
                store_xml => false
                source => [ "message" ]
                xpath => [
                        "/html/@lang","lang",
                        "/html/@class","class",
                ]
        }

Yields:

{
  "lang" => [
        [0] "en"
    ],
  "class" => [
        [0] "sample"
    ]
}

Note that the attributes come out as arrays, because there can be multiple instances of the node specified by the path. If you like, you could do something like this to convert them to strings.

mutate
{
         join => { "lang" => "," }
         join => { "class" => "," }
}

Yields:

{
         "lang" => "en",
         "class" => "sample"
}

HTH

The multiline filter requires that the html tag is at the beginning of the line - make sure your input file is exactly as mine is.

Let me know how it that helps!

Jay

In windows, use stdin input instead, and redirect the input of xml.log:

input {  
 stdin { }  
}
bin\logstash -f test.cfg < xml.log

I have not tested this, but it should work.

@fereshteh,

In your initial question, the XML was formatted on multiple lines, so we used the multiline filter. Is your expected input always on a single line? Or do you still expect to see events over multiple lines?

You can use the following to split your single line XML into multiple lines:

 mutate {
                        gsub => ["message", "</html><html", "</html>
<html"]
                }

                split {
                }

                xml {
                        source => ["message"]
                        target => ["x"]
                }

@fereshteh,

In that case, you need to first extract the XML from the rest of the message using GROK like so:

filter
{
        grok {
                match => { "message" => ".*(?<the_xml><html.*</html>).*" }        
        }
        xml {
                store_xml => false
                source => [ "the_xml" ]
                xpath => [
                        "/html/@lang","lang",
                        "/html/@class","class"
                ]
        }
}