Parsing XML log file with logstash

Hello,

I am completely new to ElasticSearch. I did successfully set it up and explore some test data but warning: I am a newbie.

I am trying to index an XML log file through logstash and I am lost as to how to procede going forward. For information, I am using windows.

Here is the structure of the XML file I am trying to parse.

<robot>
    <suite >
        <test>
            <kw>
                <doc>Some text I want to index</doc>
                <arguments>
                    <arg>Some other text I want to index</arg>

My entire XML is thus contained in the <robot> tag.

Here is my configuration file:

input {

file {

path => "pathtomyxml/file.xml"
start_position => "beginning"
sincedb_path => "NUL"
type => "xml"

}
}
filter {
xml {
  source => "message"
  store_xml => false
  xpath => [
              "//robot/suite/test/kw/doc/text()", "doc_field",
              "//robot/suite/test/kw/doc/arguments/arg/text()", "arg_field"
               ]
}

}
output
{
	stdout {
		codec => dots
	}

 	elasticsearch {
      index => "myxml-logs"
  	}

}

My goal is to store the text contained in the <doc> and <arg> fields into two separate fields.
This is what I tried based on my understanding however, I have several questions:

  • I did not really understand the source => "message" that I put in, this seems pretty standard but what does it mean? Elastic documentation was not clear enough for me.

  • Could anyone point me to what I am doing wrong ? When I run this, this gives me complete nonsense when checking it in Kibana.

Thank you in advance !

doc is not inside arguments in the XML you show, so that xml filter should be

    xml {
        source => "message"
        store_xml => false
        xpath => [
            "//robot/suite/test/kw/doc/text()", "doc_field",
            "//robot/suite/test/kw/arguments/arg/text()", "arg_field"
        ]
    }

which will give you

 "doc_field" => [
    [0] "Some text I want to index"
],
 "arg_field" => [
    [0] "Some other text I want to index"
]

if the XML is a single event. By default a file input reads each line of the file as a separate event and runs it through the pipeline. And no single line of the file is valid XML, so none of it gets parsed. You need to use a multiline filter to combine all the lines of the file into a single event.

This filter takes every line that does not match ^Spalanzani (i.e., it takes every line) and combines them into one event. The auto_flush_interval is required because otherwise it will wait forever for a line that does match ^Spalanzani.

input {
    file {
        path => "/home/user/foo.xml"
        sincedb_path => "/dev/null" start_position => "beginning"
        codec => multiline { pattern => "^Spalanzani" negate => true what => "previous" auto_flush_interval => 2 }
    }
}

This is using the file input in "tail" mode. That input also has a "read" mode which provides another way of doing this.

1 Like

Hi Badger,

First of all, thank you for your detailed answer.
I am still having trouble parsing it correctly however.

one parameter that raises question is the: source => "message" , what does it do ?
Shouldnt it be source => "robot" in my case ?

Anyway, trying your input and filter configuration:

input {

file {

path => "pathtoxml/file.xml"
start_position => "beginning"
sincedb_path => "NUL"
codec => multiline { pattern => "^Spalanzani" negate => true what => "previous" auto_flush_interval => 2 max_lines=>3000}

}
}

filter {
xml {
  source => "message"
  store_xml => false
  xpath => [
              "//robot/suite/test/kw/doc/text()", "doc_field",
              "//robot/suite/test/kw/arguments/arg/text()", "arg_field"
               ]
}

}
output
{
	stdout {
		codec => dots
	}

 	elasticsearch {
      index => "myindex"
  	}

}

It gives me the following:

=> One single hit with the "message" field containing the entire xml and "doc_field" and "arg_field" which are arrays containing each occurrence.

ie :

 "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
          {
            "_index" : "myindex",
            "_type" : "doc",
            "_id" : "xxxx",
            "_score" : 1.0,
            "_source" : {
              "@timestamp" : "2019-02-07T11:09:35.219Z",
              "tags" : [
                "multiline"
              ],
              "message" : """ ENTIRE XML !!,
    "doc_field" : [
                "text1",
                "text2",
                "text3",
                ...., ],
    "arg_field" : [
                "text1",
                "text2",
                "..." ]
 ]
    }

How can I do to:

  • not store the entire xml in the message field

  • have separate documents for each tag "kw" ? (so that each "doc_field" or "arg_field" contains one occurrence.

I hope it was clear enough, thanks again for your help.

source has to be "message" because that is the name of the field on the event that contains the XML. There is no "robot" field on the event.

You are now saying there are multiple kw elements. It sounds like you want each one to be a separate event. However, you are forcing me to guess. Please show the format of the XML and describe what output you want.

If there are multiple kw elements and you want each one as an event then the following might work.

    xml {
        source => "message"
        target => "theXML"
        store_xml => true
    }
    split { field => "[theXML][suite][0][test][0][kw]" }
1 Like

Hey,

Sorry I havent been clear, i'll try explaining myself better. Yes they are multiple kw elements and I would like to store each kw as a separate event.

My XML file has 3000 lines so I can't copy paste it all here.
Here is a snapshot:

<?xml version="1.0" encoding="UTF-8"?>
<robot generated="20190204 14:20:19.932" generator="Robot 3.0.3.dev20170213 (Python 2.7.15 on win32)">
    <suite source="C:\BAT-Copy\bat-electron\out-tsc\main\main\resources\robotframework\acceptance\Test_Case_1.txt" id="s1" name="Test Case 1">
        <test id="s1-t1" name="Default">
            <kw name="Register Keyword To Run On Failure" library="SeleniumLibrary">
                <doc>Sets the keyword to execute when a SeleniumLibrary keyword fails.</doc>
                <arguments>
                    <arg>Nothing</arg>
                </arguments>
                <msg timestamp="20190204 14:20:40.248" level="INFO">No keyword will be run on failure.</msg>
                <status status="PASS" endtime="20190204 14:20:40.248" starttime="20190204 14:20:40.248"></status>
            </kw>

And I would like to store each kw as an event with the corresponding json format, so:

{
   "_index":"myindex",
   "kw":{
      "doc":"Sets the keyword to execute when a Selenium...",
      "arguments":{
         "arg":"Nothing"
      },
      "msg":"No Keyword will..",
      "status":""
   }
}

, etc.

Each <kw></kw> should be one document.

Thanks in advance,

OK, then you should try these two and see which one you prefer

xml { source => "message" target => "theXML" store_xml => true }
split { field => "[theXML][suite][0][test][0][kw]" }

xml { source => "message" target => "theXML" store_xml => true force_array => false }
split { field => "[theXML][suite][test][kw]" }

The latter will give you events like this

{
   "message" => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<robot generated=\"20190204 14:20:19.932\" generator=\"Robot 3.0.3.dev20170213 (Python 2.7.15 on win32)\">\n    [...]",
      "tags" => [
    [0] "multiline"
],
  "@version" => "1",
    "theXML" => {
    "generated" => "20190204 14:20:19.932",
    "generator" => "Robot 3.0.3.dev20170213 (Python 2.7.15 on win32)",
        "suite" => {
        "source" => "C:\\BAT-Copy\\bat-electron\\out-tsc\\main\\main\\resources\\robotframework\\acceptance\\Test_Case_1.txt",
          "test" => {
              "kw" => {
                "arguments" => {
                    "arg" => "Some more text I want to index"
                },
                      "doc" => "Some other text I want to index"
            },
              "id" => "s1-t1",
            "name" => "Default"
        },
            "id" => "s1",
          "name" => "Test Case 1"
    }
}
}

Typically after using xml+split you will want to use mutate+rename to move fields around.

1 Like

Thanks !! 2nd result is better

Have a great day.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.