XML Dynamic Parsing

Still new to ELK. I'd like to parse an XML with Logstash and output to Elasticsearch.

XML:

<mydata>
    <data1>O</data1>
    <data2>false</data2>
            .
            .
            .
    <data3 REPEATINGTYPE="PageGroup">
        <rowdata REPEATINGINDEX="subdata1">
            <datax1>mycontent1</datax1>
            <datax2>mycontent2</datax2>
            .
            .
            .
        </rowdata>
        <rowdata REPEATINGINDEX="subdata2">
            <datax1>mycontent1</datax1>
            <datax2>mycontent2</datax2>
            .
            .
            .
        </rowdata>
    .
    .
    .
</mydata>

My Config File:

input{
    file{
        path => "/usr/share/logstash/bin/myXML.xml"
        start_position => beginning
    }
}
#filter{
#    I DON'T KNOW WHAT TO PUT HERE
#}
output{
    elasticsearch{
        hosts => ["localhost:9200"]
        user => elastic
        password => changeme
    }
    stdout{}
}

I am confused as to how I should make my config file with the filters and what not. The XML file will have nearly hundreds of fields (hence the ". . . " and some have sub-fields (sort of a Object Oriented way of encapsulating the data within other data like a Class in Java). Is there a way to dynamically parse the XML file so I don't have to manually define the fields and the contents of them?

Also, am I outputting to Elasticsearch correctly?

Any help would be greatly appreciated :slight_smile:

Use ruby code plugin with Nokogiri ruby extension.

@Suman_Reddy1 is there any documentation and examples of ruby and Nokogiri anywhere?

I found this: https://www.elastic.co/guide/en/logstash/current/plugins-filters-ruby.html

You should use a multiline codec on the input to consume the entire file as a single event. There are many threads about how to do that. Then you can use a logstash xml filter to parse the XML

filter {
  xml { source => "message" target => "theXML" }
}

The output looks like this

        "theXML" => {
        "data3" => [
            [0] {
                "REPEATINGTYPE" => "PageGroup",
                      "rowdata" => [
                    [0] {
                                "datax1" => [
                            [0] "mycontent1"
                        ],
                                "datax2" => [
                            [0] "mycontent2"
                        ],
                        "REPEATINGINDEX" => "subdata1"
                    },
                    [1] {
                                "datax1" => [
                            [0] "mycontent1"
                        ],
                                "datax2" => [
                            [0] "mycontent2"
                        ],
                        "REPEATINGINDEX" => "subdata2"
                    }
                ]
            }
        ],
        "data1" => [
            [0] "O"
        ],
        "data2" => [
            [0] "false"
        ]
    }

or, if you set force_array => false

        "theXML" => {
        "data3" => {
            "REPEATINGTYPE" => "PageGroup",
                  "rowdata" => [
                [0] {
                            "datax2" => "mycontent2",
                    "REPEATINGINDEX" => "subdata1",
                            "datax1" => "mycontent1"
                },
                [1] {
                            "datax2" => "mycontent2",
                    "REPEATINGINDEX" => "subdata2",
                            "datax1" => "mycontent1"
                }
            ]
        },
        "data1" => "O",
        "data2" => "false"
    }

Ok. My bad, I didnt see your configuration.
With your configuration, logstash will read each line as a new event. To fix that use multiline codec. Multiline codec will aggregate multiple lines into a single log event, in this case it will create one xml file. There are plenty of sample around multiline please go through them. As I am on mobile device unable to give you exact config.
Secondly, once we aggregated and created a single xml file use ruby code.

ruby{
code => “
require nokogiri
xml=event.get(“message”)
<>

}

Above is a reference how to parse xml

@Badger does the Logstash XML filter work to dynamically parse out every tag in the XML? In other words, do I have to specify each field in the filter that exists in my XML?

@Suman_Reddy1 Thank you. I will take a look into the Ruby and Nokogiri. If you have any more info once you have time, I'd really appreciate that too. I'll try and keep learning.

Yes. The filter I showed parsed out all the fields in the XML

Below is a recursive way of iterating all elements in an xml

ruby {
		code => "
					require 'nokogiri'def iterative(ele)
        	ele.children.each do |tempNode|
        		if tempNode.text?
        			puts tempNode.content
        		else
        			iterative(tempNode)
        		end
        	end
        end
        xml_doc = Nokogiri::XML.parse(event.get('xml-data'))
        iterative(xml_doc)"
      }

Above is the sample, which we used to parse xml do some inline masking on the data. This should give you some insight on xml processing. If you dont have to do much manipulation on XML, I would suggest Badger solution rather than this.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.