Xml filtering subfields

Hey!
I'm a complete beginner to everything elasticsearch/logstash/kibana related, so please have mercy. I have a xml-File with many Eventlogs that is (simplified) structured like this:

   <xmldata>
        <Event>
             <Computer>...</Computer>
             ...
             <EventData> 
                <Data Name = "ErrorCode"> 0 </Data>
                <Data Name = "PrincipalSamName"> .support </Data>
                <Data Name = "Status"> 0xc00484b2 </Data>
                ...
             </EventData>
        </Event>
        <Event>
           ...
       </Event>
   </xmldata>

My problem is, that I don't know how to properly acces all the subfields in the EventData Block.

So far my config file looks like this:

input {
  file {
        path => "/media/sf_Shared_Folder/test/logstash/evtx_middle.xml"
        start_position => beginning
        sincedb_path => "/dev/null"
        codec => multiline
        {
                pattern => "^<\?xmldata .*>"
                negate =>"true"
                what => "previous"
                max_lines => 2000
        }
        }
       }

filter {
  xml {
        source => "message"
        target => "theXML"
        store_xml =>"false"
        remove_namespaces => true
        xpath => ["/xmldata/Event", "Event"]
       }

  mutate {
        remove_field => ["message"]
         }

  split {
        field => "[Event]"
        }

  xml {
        source => "Event"
        store_xml => "false"
        remove_namespaces => true
        xpath =>[
                "/Event/System/Computer/text()","Computer",               
                "/Event/EventData/Data/@Name","Data Name",
                "/Event/EventData/Data/text()","Data"
                ]
      } 
    mutate {
        update => {
                "Computer" => "%{[Computer][0]}"
                }
          }

  mutate {              
          add_field => {"%{[Data Name][0]}" => "%{[Data][0]}"}
               }

  date {
        match => ["SystemTime", "YYYY-MM-dd HH:mm:ss.SSSSSS"]
        timezone => "Europe/Berlin"
       }

  mutate {
        remove_field => ["Data Name", "Data"]
         } 
     }
output {
        stdout { codec => rubydebug}
        }

But like that I only get the first subfield, so just "Errorcode" => "0".
But I need all the fields.

I hope you're able to understand my problem, and I'm sure (or at least I hope so) there's an easy solution to it, but I feel like I've tried everything and it won't work properly.

Really hope somebody can help me, because I've been trying to firgure this out for days now.
Thanks in advance!

I don't think it makes sense to use two xml filters on the same xml. I would save the parsed xml and work on that.

filter {
    xml {
        source => "message"
        target => "[@metadata][theXML]"
        store_xml => true
        remove_namespaces => true
        force_array => false
        remove_field => ["message"]
    }

    split {
        field => "[@metadata][theXML][Event]"
    }

    ruby {
        code => '
            e = event.get("[@metadata][theXML][Event][EventData][Data]")
            if e
                e.each { |x|
                    event.set(x["Name"], x["content"])
                }
            end
        '
    }
    mutate { copy => { "[@metadata][theXML][Event][Computer]" => "Computer" } }
}

You can use a rubydebug output to inspect the parsed XML.

output { stdout { codec => rubydebug { metadata => true } } }

Thanks for the feedback and the code.

I will give it a shot.

The challenge is this is a piece of manufacturing equipment and each recipe has the same format but depending on the items it is inspecting them you will see a different count. What I posted was only 3 inspections on a printed circuit board, but the actual file has over 10,000 inspections so just to write the configuration file for that would be a nightmare. Hopefully this code will work through it.

First of all, sorry that it took me so long to reply, had no access to the files over the long weekend. Thanks for your input! So I guess there's not just a simple line to add to my filter as it was? Cause everything else was working fine, so I was hoping it wasnt all "useless".
But I will try your suggestion now. Can I ask where the @metadata comes from though? Just trying to actually understand everything.

Thanks again!

Unfortunately that doesn't work for me. If I put it in like that, I get a xmlparsefailure and a split_type_failure as well. I also tried adding the missing [xmldata] before all of the [Events], but still not better. It all comes out as one message, not several Events. If I make the split like I did in my config file, that works at least, but then it's just all in one block called "Event". Unfortunately I know nothing about ruby, so no idea what I could change really.

If you are getting an _xmlparsefailure tag then the [message] field is not valid XML. We cannot help you with that unless you can show us the actual [message] field.

Alright, I changed everythig back again so that it is exactly like your suggestion, and that's what I get for the message field:

 "message" => "<xmldata>\r\n<Event xmlns=\"http://schemas.microsoft.com/win/2004/08/events/event\"><System><Provider Name=\"Microsoft-Windows-AAD\" Guid=\"{4de9bc9c-b27a-43c9-8994-0915f1a5e24f}\"></Provider>\r\n<EventID Qualifiers=\"\">1089</EventID>\r\n<Version>0</Version>\r\n<Level>2</Level>\r\n<Task>101</Task>\r\n<Opcode>0</Opcode>\r\n<Keywords>0x4000000000000012</Keywords>\r\n<TimeCreated SystemTime=\"2019-02-01 15:08:46.508312\"></TimeCreated>\r\n<EventRecordID>19</EventRecordID>\r\n<Correlation ActivityID=\"{0114a2a5-ba40-0001-b6a2-140140bad401}\" RelatedActivityID=\"\"></Correlation>\r\n<Execution ProcessID=\"692\" ThreadID=\"696\"></Execution>\r\n<Channel>Microsoft-Windows-AAD/Operational</Channel>\r\n<Computer>xxx</Computer>\r\n<Security UserID=\"S-1-5-18\"></Security>\r\n</System>\r\n<EventData><Data Name=\"Status\">0xc00484b2</Data>\r\n</EventData>\r\n</Event>\r\n<Event xmlns=\"http://schemas.microsoft.com/win/2004/08/events/event\"><System><Provider Name=\"Microsoft-Windows-AAD\" Guid=\"{4de9bc9c-b27a-43c9-8994-0915f1a5e24f}\"></Provider>\r\n<EventID Qualifiers=\"\">1104</EventID>\r\n<Version>0</Version>\r\n<Level>2</Level>\r\n<Task>101</Task>\r\n<Opcode>0</Opcode>\r\n<Keywords>0x4000000000000012</Keywords>\r\n<TimeCreated SystemTime=\"2019-02-01 15:08:46.508322\"></TimeCreated>\r\n<EventRecordID>20</EventRecordID>\r\n<Correlation ActivityID=\"{0114a2a5-ba40-0001-b6a2-140140bad401}\" RelatedActivityID=\"\"></Correlation>\r\n<Execution ProcessID=\"692\" ThreadID=\"696\"></Execution>\r\n<Channel>Microsoft-Windows-AAD/Operational</Channel>\r\n<Computer>xxx</Computer>\r\n<Security UserID=\"S-1-5-18\"></Security>\r\n</System>\r\n<EventData><Data Name=\"API\">Plugin initialize</Data>\r\n<Data Name=\"Result\">3221521586</Data>\r\n</EventData>\r\n</Event>\r\n<Event xmlns=\"http://schemas.microsoft.com/win/2004/08/events/event\"><System><Provider Name=\"Microsoft-Windows-AAD\" Guid=\"{4de9bc9c-b27a-43c9-8994-0915f1a5e24f}\"></Provider>\r\n<EventID Qualifiers=\"\">1089</EventID>\r\n<Version>0</Version>\r\n<Level>2</Level>\r\n<Task>101</Task>\r\n<Opcode>0</Opcode>\r\n<Keywords>0x4000000000000012</Keywords>\r\n<TimeCreated SystemTime=\"2019-02-05 13:29:08.712667\"></TimeCreated>\r\n<EventRecordID>21</EventRecordID>\r\n<Correlation ActivityID=\"{bf1d25ff-bd56-0005-0026-1dbf56bdd401}\" RelatedActivityID=\"\"></Correlation>\r\n<Execution ProcessID=\"688\" ThreadID=\"692\"></Execution>\r\n<Channel>Microsoft-Windows-AAD/Operational</Channel>\r\n<Computer>xxx</Computer>\r\n<Security UserID=\"S-1-5-18\"></Security>\r\n</System>\r\n<EventData><Data Name=\"Status\">0xc00484b2</Data>\r\n</EventData>\r\n</Event>\r\n<Event xmlns=\"http://schemas.microsoft.com/win/2004/08/events/event\"><System><Provider Name=\"Microsoft-Windows-AAD\" Guid=\"{4de9bc9c-b27a-43c9-8994-0915f1a5e24f}\"></Provider>\r\n<EventID Qualifiers=\"\">1104</EventID>\r\n<Version>0</Version>\r\n<Level>2</Level>\r\n<Task>101</Task>\r\n<Opcode>0</Opcode>\r\n<Keywords>0x4000000000000012</Keywords>\r\n<TimeCreated SystemTime=\"2019-02-05 13:29:08.712677\"></TimeCreated>\r\n<EventRecordID>22</EventRecordID>\r\n<Correlation ActivityID=\"{bf1d25ff-bd56-0005-0026-1dbf56bdd401}\" RelatedActivityID=\"\"></Correlation>\r\n<Execution ProcessID=\"688\" ThreadID=\"692\"></Execution>\r\n<Channel>Microsoft-Windows-AAD/Operational</Channel>\r\n<Computer>xxx</Computer>\r\n<Security UserID=\"S-1-5-18\"></Security>\r\n</System>\r\n<EventData><Data Name=\"API\">Plugin initialize</Data>\r\n<Data Name=\"Result\">3221521586</Data>\r\n</EventData>\r\n</Event>\r",
          "tags" => [
        [0] "multiline",
        [1] "_xmlparsefailure",
        [2] "_split_type_failure"

exception=>#<REXML::ParseException: No close tag for /xmldata

Your multiline pattern is not capturing the close tag. If you want to consume the entire file you could use read mode for the file input. Alternatively, use a pattern that will never match and emit the event based on a timeout

codec => multiline { pattern => "^Spalanzani" what => "previous" negate => true auto_flush_interval => 1 }

what do you mean by read mode? Like I said, I'm completeley new to logstash, so sorry for the stupid questions, but google couldnt help me either.
Also, why did it work with my original config file if I didn't change anything in the input-section?

A file input can operate in either tail more or read mode. The documentation explains what each one does.

I cannot explain why the behaviour would have changed. I just know that your message is missing the close tag.

Alright, thank you, that seems to work in some ways now.
Though now I get a ruby_exception: "no implicit conversion of String to Integer" for the Events that have a field with the name "Status" that contains something like "0xc00484b2", it works fine for the other Events without that field though. Any tips on how to fix that?
Also, now my file gets deleted as soon as I run logstash. Has that something to do with the ruby exception or is there another problem that could cause that?

I figured out that the file gets deleted because of the read mode, it deletes the file after it's done with it by default.
So I'm only left with the ruby_exception.

Which filter generates that exception?

What do you mean by which filter?

[ERROR] 2019-04-24 13:41:34.697 [[main]>worker0] ruby - Ruby exception occurred: no implicit conversion of String into Integer
[ERROR] 2019-04-24 13:41:34.716 [[main]>worker0] ruby - Ruby exception occurred: no implicit conversion of String into Integer

that's what it says. Or do you need anythign else?
Also, tried out a few things, and I think the reason is not the "Status" field itself, but the fact that it is the only one in the EventData-Block for the specific events, if that makes any sense.
Like, when there's more than one field like here:

<EventData>
   <Data Name="API">Plugin initialize</Data>
   <Data Name="Result">3221521586</Data>
</EventData>

it works fine, but when there's just one:

<EventData>
  <Data Name="Status">0xc00484b2</Data>
</EventData>

it throws the ruby_exception. I tried adding another field to those Events and then it worked, so that's why I guess that's the problem. What can I do about that?

Update the ruby filter to handle the case where [Event][EventData][Data] is not an array.

ruby {
    code => '
        e = event.get("[@metadata][theXML][Event][EventData][Data]")
        if e.kind_of?(Array)
            e.each { |x|
                event.set(x["Name"], x["content"])
            }
        else
            event.set(e["Name"], e["content"])
        end
    '
}

Great, everything's working now, thank you!

Hey, sorry to address the topic again, but another problem came up and I can't seem to find a solution for it myself since I assume it requires ruby knowledge again, so I hope you can help me with it again.
I get a _rubyexception for all the Events that look like this:

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"><System><Provider Name="Qlik Sense Service Dispatcher"></Provider>
<EventID Qualifiers="0">300</EventID>
<Level>4</Level>
<Task>0</Task>
<Keywords>0x0080000000000000</Keywords>
<TimeCreated SystemTime="2019-01-03 08:36:19.361464"></TimeCreated>
<EventRecordID>139827</EventRecordID>
<Channel>Application</Channel>
<Computer>...</Computer>
<Security UserID=""></Security>
</System>
<EventData><Data>&lt;string&gt;Child process (2624) started
Facility = Next-generation Broker Service
ExePath = Node\node.exe
Script = ..\BrokerService\index.js
Command Line"Node\node.exe" "..\BrokerService\index.js" --port=4900 --log-path="..."&lt;/string&gt;
</Data>
<Binary></Binary>
</EventData>
</Event>

So the problem is, that the Event-Block doesn't follow the usual structure of
< Data Name=API> ...</Data> but instead there's just one big block.
Any tips on how to include that in the ruby filter?

btw, the exception I get is: no implicit conversion of nil into string.

When I tried it on a few selected Events they at least still showed up in Kibana, just with the tag _rubyexception.

But now that I've tried it with around 60,000 Events they won't even go through to elasticsearch, there I get the following warning:

[WARN ] 2019-05-15 14:22:01.385 [[main]>worker1] elasticsearch - Could not index event to Elasticsearch. {:status=>400, :action=>["index", {:_id=>"128712", :_index=>"evtx44", :_type=>"xmlfiles", :routing=>nil}, #<LogStash::Event:0x4d92f2cc>], :response=>{"index"=>{"_index"=>"evtx44", "_type"=>"xmlfiles", "_id"=>"128712", "status"=>400, "error"=>{"type"=>"mapper_parsing_exception", "reason"=>"object mapping for [theXML.Event.EventData.Data] tried to parse field [Data] as object, but found a concrete value"}}}}

The first problem you can solve by verifying Data is a hash before fetching its Name and content elements.

        code => '
            FieldName = "[theXML][Event][EventData][Data]"
            e = event.get(FieldName)
            if e.kind_of?(Array)
                e.each { |x|
                    event.set(x["Name"], x["content"])
                }
            elsif e.kind_of?(Hash)
                event.set(e["Name"], e["content"])
            end
        '

For the second... at first you had events where [theXML][EventData][Event][Data] was either an array

    "theXML" => {
    "Event" => {
        "EventData" => {
            "Data" => [
                [0] {
                    "content" => "Plugin initialize",
                       "Name" => "API"
                },
                [1] {
                    "content" => "3221521586",
                       "Name" => "Result"
                }
            ]
        }
    }

or a hash

    "theXML" => {
    "Event" => {
        "EventData" => {
            "Data" => {
                "content" => "0xc00484b2",
                   "Name" => "Status"
            }
        }
    }
},

elasticsearch would accept either an array or a hash because both are structured objects. Now you have events where the Data field is a string

    "theXML" => {
    "Event" => {
        "EventData" => {
            "Data" => "<string>Child process (2624) started Facility = Next-generation Broker Service ExePath = Node\\node.exe Script = ..\\BrokerService\\index.js Command Line\"Node\\node.exe\" \"..\\BrokerService\\index.js\" --port=4900 --log-path=\"...\"</string> "
        }
    }
},

when it gets a string elasticsearch sets its type to "text", not "object', and you cannot have both types in the same field. To resolve this, if the ruby code has extracted all the interesting data from [theXML][Event][EventData][Data] when it is an Array or Hash, then use

event.remove(FieldName)

in both branches of the if. Otherwise add a branch to the if to turn text into object

            else
                event.set(FieldName, { "Data" => e })

Thanks for your help again! It works a lot better now, I get a lot more events through to Elasticsearch, but still not all for some reason. I don't get any more errors, the only thing that shows up is:
(ruby filter code):3: warning: already initialized constant FieldName
all over the screen before the actual data appears. Could that have something to do with it?