Parsing complicated XML - extracting nested fields and splitting


(Lsokal) #1

Hello,

I have the following multiline XML:

<SharedProject id="1">
 <m__modelerProxy id="1">
		<m__building id="140">
     <m__name>z2</m__name>
 <m__common class="SharedBuilding" id="141">
        <m__name>z2</m__name>
        <m__modelerProxy class="be.lucid.PEB.model.geometry.Zone" reference="140" />
        <m__energeticProxy class="Building" id="142">
           <attributes id="143">
           </attributes>
           <labels id="144" />
           <name>z2</name>
           <pebIdentifer>2</pebIdentifer>
           <pebUniqueIdentifer>2</pebUniqueIdentifer>
           <proxy class="SharedBuilding" reference="141" />
           <plot id="145">
           <protectedVolumes id="627">
		<ProtectedVolume id="628">
			<envelope id="631">
				<paroiRoles id="633">
					<ParoiRole id="634">
                                              <attributes id="635" />
                                              <labels id="636" />
                                              <name>236</name>
                                              <pebIdentifer>237</pebIdentifer>
                                              <position>SIDE</position>
                                              <paroi id="637">
					</ParoiRole>
				        <ParoiRole id="7395">
                                              <attributes id="7396" />
                                              <labels id="7397" />
                                              <name>237</name>
                                              <pebIdentifer>238</pebIdentifer>
                                              <position>SIDE</position>
                                             <paroi id="797">
				        </ParoiRole>
                                        ...
                                </paroiRoles>

I want to store only what's inside "ProtectedVolume" field and then split based on the "ParoiRole" field (I want separate events for this field).

I tried running the following input and filter configurations

input {

file {
path => "pathtoxml/file.xml"
start_position => "beginning"
sincedb_path => "NUL"
codec => multiline { pattern => "^Spalanzani" negate => true what => "previous" auto_flush_interval => 2 max_lines=>30000}
  }
}

and

filter {
xml { source => "message" target => "theXML" store_xml => true force_array => false }


split { field => "[theXML][m__modelerProxy][m__building][m__common][m__energeticProxy][protectedVolumes][ProtectedVolume][envelope][paroiRoles][ParoiRole]" remove_field => "message"}

mutate
{
     remove_field => [ "message"]
}

}

However, I do not know how to remove all the fields before the <ProtectedVolume> and get the results I want.

Thanks in advance for your help,


#2

Many of your XML elements are not terminated. I am guessing most of the should be of the <element attribute="a" /> form. If not then the path inside theXML should include all the unclosed elements.

    xml { source => "message" target => "[theXML]" store_xml => true force_array => false }
    mutate { rename => { "[theXML][protectedVolumes][ProtectedVolume]" => "ProtectedVolume" } }
    mutate { remove_field => [ "message", "theXML" ] }
    split { field => "[ProtectedVolume][paroiRoles][ParoiRole]" }

(Lsokal) #3

Thank you very much, this is indeed what I was looking for.

For information, each tags have closing tags in my XML file but I am sharing a simplified version because the real one has 30k lines and messier.

I have noticed an issue with my XML and its more complicated that I thought.
There is a nested Pattern that I would like to extract but I dont know how.

Thank to you, I have successfully splitted my XML with
split { field => "[ProtectedVolume][envelope][paroiRoles][ParoiRole]" }

However, this tag patterns will sometimes be nested further, see example below:

Inside <ParoiRole>, we have a <volume> tag then another <paroiRoles><ParoiRole> pattern which is repeated. In this case, I would like to store it in a separate event again.

<SharedProject id="1">
 <m__modelerProxy id="1">
		<m__building id="140">
     <m__name>z2</m__name>
 <m__common class="SharedBuilding" id="141">
        <m__name>z2</m__name>
        <m__modelerProxy class="be.lucid.PEB.model.geometry.Zone" reference="140" />
        <m__energeticProxy class="Building" id="142">
           <attributes id="143">
           </attributes>
           <labels id="144" />
           <name>z2</name>
           <pebIdentifer>2</pebIdentifer>
           <pebUniqueIdentifer>2</pebUniqueIdentifer>
           <proxy class="SharedBuilding" reference="141" />
           <plot id="145">
           <protectedVolumes id="627">
		<ProtectedVolume id="628">
			<envelope id="631">
				<paroiRoles id="633">
					<ParoiRole id="634">
                                              <attributes id="635" />
                                              <labels id="636" />
                                              <name>236</name>
                                              <pebIdentifer>237</pebIdentifer>
                                              <position>SIDE</position>
                                              <paroi id="637">
					</ParoiRole>
				        <ParoiRole id="7395">
                                              <attributes id="7396" />
                                              <labels id="7397" />
                                              <name>237</name>
                                              <pebIdentifer>238</pebIdentifer>
                                              <position>SIDE</position>
                                             <paroi id="797">
				        </ParoiRole>
                                        <ParoiRole>
                                           <attributes id="7398" />
                                           <labels id="7399" />
                                           <volume>
                                             <paroiRoles>
                                                <ParoiRole>
                                                   ...
                                                  
                                        ...
                                </paroiRoles>

In my case, I actually have 17 ParoiRole nested in the first one that end up in a single document when indexed. I am not sure which approach is best, the end-goal would be to generate statistics on fields containted in ParoiRole. The queries are messed up right now because of this missed nested pattern.

Is my request achievable through logstash ? Or should I ask for cleaner data or transform it myself with another tool ?

I hope my post was clear enough, I must admit to be a newbie working with a confusing XML..

Thanks and regards,


#4

You can do a second split if the nested field exists.

    split { field => "[ProtectedVolume][envelope][paroiRoles][ParoiRole]" }
    if [ProtectedVolume][envelope][paroiRoles][ParoiRole][volume][paroiRoles] {
        split { field => "[ProtectedVolume][envelope][paroiRoles][ParoiRole][volume][paroiRoles]" }
    }

(Lsokal) #5

Thanks, this gives me exactly what I needed.