Parsing XML log file with logstash

doc is not inside arguments in the XML you show, so that xml filter should be

    xml {
        source => "message"
        store_xml => false
        xpath => [
            "//robot/suite/test/kw/doc/text()", "doc_field",
            "//robot/suite/test/kw/arguments/arg/text()", "arg_field"
        ]
    }

which will give you

 "doc_field" => [
    [0] "Some text I want to index"
],
 "arg_field" => [
    [0] "Some other text I want to index"
]

if the XML is a single event. By default a file input reads each line of the file as a separate event and runs it through the pipeline. And no single line of the file is valid XML, so none of it gets parsed. You need to use a multiline filter to combine all the lines of the file into a single event.

This filter takes every line that does not match ^Spalanzani (i.e., it takes every line) and combines them into one event. The auto_flush_interval is required because otherwise it will wait forever for a line that does match ^Spalanzani.

input {
    file {
        path => "/home/user/foo.xml"
        sincedb_path => "/dev/null" start_position => "beginning"
        codec => multiline { pattern => "^Spalanzani" negate => true what => "previous" auto_flush_interval => 2 }
    }
}

This is using the file input in "tail" mode. That input also has a "read" mode which provides another way of doing this.

1 Like