Turning one .xml into multiple events where not all values are always filled

Hello All,

I'm learning Logstash basics and I'm trying to find a way of converting an .xml file into multiple events (separate json files) which will be sent out to elasticsearch. The thing is, specific elements of xml might vary in terms of content. One event might have test info, the other one not, etc. More details below. Each .xml consists of two lines, one xml metadata and one content line. I'm interested only in content lines. Here's source file example:

<?xml version="1.0" encoding="UTF-8"?>
<BATCH TIMESTAMP="2019-01-02T04:04:51.931+01:00" SOFTWARE="1.0"><PLANT NAME="PARIS" LINE="PRODLINE1"/><PRODUCT ID="3" NAME="MOTHERBOARD"><GROUP ID="1" NAME="TESTGROUP1"><TEST ID="10" NAME="VOLTAGETEST" VALUE="2.34523" STATUS="OK"/><TEST ID="20" NAME="INTEGRATION" VALUE="1.00000" STATUS="NOK"/></GROUP><GROUP ID="2" NAME="CHANGEOVER">DONE</GROUP><GROUP ID="3" NAME="NOTIFICATION"><EXTRA TEXT="OK"/></GROUP><GROUP ID="4" NAME="VOLTAGE_NOTIF"><TEST ID="10" NAME="VOLTAGETEST" VALUE="4.00001" STATUS="NOK"/><EXTRA TEXT="OK"/></GROUP></PRODUCT></BATCH>

Formatted for better readability:

<?xml version="1.0" encoding="UTF-8"?>
<BATCH TIMESTAMP="2019-01-02T04:04:51.931+01:00" SOFTWARE="1.0">
<PLANT NAME="PARIS" LINE="PRODLINE1"/>
<PRODUCT ID="3" NAME="MOTHERBOARD">
	<GROUP ID="1" NAME="TESTGROUP1">
		<TEST ID="10" NAME="VOLTAGETEST" VALUE="2.34523" STATUS="OK"/>
		<TEST ID="20" NAME="INTEGRATION" VALUE="1.00000" STATUS="NOK"/>
	</GROUP>
	<GROUP ID="2" NAME="CHANGEOVER">DONE</GROUP>
	<GROUP ID="3" NAME="NOTIFICATION">
		<EXTRA TEXT="OK"/>
	</GROUP>
	<GROUP ID="4" NAME="VOLTAGE_NOTIF">
		<TEST ID="10" NAME="VOLTAGETEST" VALUE="4.00001" STATUS="NOK"/>
		<EXTRA TEXT="OK"/>
	</GROUP>
</PRODUCT>
</BATCH>

And here's desired results - 5 separate events:

{"BATCH_TIMESTAMP": "2019-01-02T04:04:51.931+01:00", "BATCH_SOFTWARE": "1.0", "PLANT_NAME": "PARIS", "PLANT_LINE": "PRODLINE1", "PRODUCT_ID": "3", "PRODUCT_NAME": "MOTHERBOARD", "GROUP_ID": "1", "GROUP_NAME": "TESTERGROUP1", "GROUP_VALUE": "", "TEST_ID": "10", "TEST_NAME": "VOLTAGETEST", "TEST_VALUE": "2.34523", "TEST_STATUS": "OK", "EXTRA_TEXT": ""}
{"BATCH_TIMESTAMP": "2019-01-02T04:04:51.931+01:00", "BATCH_SOFTWARE": "1.0", "PLANT_NAME": "PARIS", "PLANT_LINE": "PRODLINE1", "PRODUCT_ID": "3", "PRODUCT_NAME": "MOTHERBOARD", "GROUP_ID": "1", "GROUP_NAME": "TESTERGROUP1", "GROUP_VALUE": "", "TEST_ID": "20", "TEST_NAME": "INTEGRATION", "TEST_VALUE": "1.00000", "TEST_STATUS": "NOK", "EXTRA_TEXT": ""}
{"BATCH_TIMESTAMP": "2019-01-02T04:04:51.931+01:00", "BATCH_SOFTWARE": "1.0", "PLANT_NAME": "PARIS", "PLANT_LINE": "PRODLINE1", "PRODUCT_ID": "3", "PRODUCT_NAME": "MOTHERBOARD", "GROUP_ID": "2", "GROUP_NAME": "CHANGEOVER", "GROUP_VALUE": "DONE", "TEST_ID": "", "TEST_NAME": "", "TEST_VALUE": "", "TEST_STATUS": "", "EXTRA_TEXT": ""}
{"BATCH_TIMESTAMP": "2019-01-02T04:04:51.931+01:00", "BATCH_SOFTWARE": "1.0", "PLANT_NAME": "PARIS", "PLANT_LINE": "PRODLINE1", "PRODUCT_ID": "3", "PRODUCT_NAME": "MOTHERBOARD", "GROUP_ID": "3", "GROUP_NAME": "NOTIFICATION", "GROUP_VALUE": "", "TEST_ID": "", "TEST_NAME": "", "TEST_VALUE": "", "TEST_STATUS": "", "EXTRA_TEXT": "OK"}
{"BATCH_TIMESTAMP": "2019-01-02T04:04:51.931+01:00", "BATCH_SOFTWARE": "1.0", "PLANT_NAME": "PARIS", "PLANT_LINE": "PRODLINE1", "PRODUCT_ID": "3", "PRODUCT_NAME": "MOTHERBOARD", "GROUP_ID": "4", "GROUP_NAME": "VOLTAGE_NOTIF", "GROUP_VALUE": "", "TEST_ID": "10", "TEST_NAME": "VOLTAGETEST", "TEST_VALUE": "4.00001", "TEST_STATUS": "NOK", "EXTRA_TEXT": "OK"}

Formatted and explained below:

  • Event 1 (Group 1, Test 10, No Extra Text):

{
"BATCH_TIMESTAMP":"2019-01-02T04:04:51.931+01:00",
"BATCH_SOFTWARE":"1.0",
"PLANT_NAME":"PARIS",
"PLANT_LINE":"PRODLINE1",
"PRODUCT_ID":"3",
"PRODUCT_NAME":"MOTHERBOARD",
"GROUP_ID":"1",
"GROUP_NAME":"TESTERGROUP1",
"GROUP_VALUE":"",
"TEST_ID":"10",
"TEST_NAME":"VOLTAGETEST",
"TEST_VALUE":"2.34523",
"TEST_STATUS":"OK",
"EXTRA_TEXT":""
}

  • Event 2 (Group 1, Test 20, No Extra Text):

{
"BATCH_TIMESTAMP":"2019-01-02T04:04:51.931+01:00",
"BATCH_SOFTWARE":"1.0",
"PLANT_NAME":"PARIS",
"PLANT_LINE":"PRODLINE1",
"PRODUCT_ID":"3",
"PRODUCT_NAME":"MOTHERBOARD",
"GROUP_ID":"1",
"GROUP_NAME":"TESTERGROUP1",
"GROUP_VALUE":"",
"TEST_ID":"20",
"TEST_NAME":"INTEGRATION",
"TEST_VALUE":"1.00000",
"TEST_STATUS":"NOK",
"EXTRA_TEXT":""
}

  • Event 3 (Group 2, No Test, No Extra Text):

{
"BATCH_TIMESTAMP":"2019-01-02T04:04:51.931+01:00",
"BATCH_SOFTWARE":"1.0",
"PLANT_NAME":"PARIS",
"PLANT_LINE":"PRODLINE1",
"PRODUCT_ID":"3",
"PRODUCT_NAME":"MOTHERBOARD",
"GROUP_ID":"2",
"GROUP_NAME":"CHANGEOVER",
"GROUP_VALUE":"DONE",
"TEST_ID":"",
"TEST_NAME":"",
"TEST_VALUE":"",
"TEST_STATUS":"",
"EXTRA_TEXT":""
}

  • Event 4 (Group 3, No Text, Extra Text Present):

{
"BATCH_TIMESTAMP":"2019-01-02T04:04:51.931+01:00",
"BATCH_SOFTWARE":"1.0",
"PLANT_NAME":"PARIS",
"PLANT_LINE":"PRODLINE1",
"PRODUCT_ID":"3",
"PRODUCT_NAME":"MOTHERBOARD",
"GROUP_ID":"3",
"GROUP_NAME":"NOTIFICATION",
"GROUP_VALUE":"",
"TEST_ID":"",
"TEST_NAME":"",
"TEST_VALUE":"",
"TEST_STATUS":"",
"EXTRA_TEXT":"OK"
}

  • Event 5 (Group 4, Test 10, Extra Text Present):

{
"BATCH_TIMESTAMP":"2019-01-02T04:04:51.931+01:00",
"BATCH_SOFTWARE":"1.0",
"PLANT_NAME":"PARIS",
"PLANT_LINE":"PRODLINE1",
"PRODUCT_ID":"3",
"PRODUCT_NAME":"MOTHERBOARD",
"GROUP_ID":"4",
"GROUP_NAME":"VOLTAGE_NOTIF",
"GROUP_VALUE":"",
"TEST_ID":"10",
"TEST_NAME":"VOLTAGETEST",
"TEST_VALUE":"4.00001",
"TEST_STATUS":"NOK",
"EXTRA_TEXT":"OK"
}

I'm trying to figure it out using both xml and split filters, but no success as of yet. I'd appreciate the suggestions especially on how to handle different structures of event if they are not present in the source like in Event 3 (tests, extra text).
Thanks in advance!

I suggest you start with

    xml { source => "message" target => "theXML" }
    split { field => "[theXML][PRODUCT][0][GROUP]" }
    split { field => "[theXML][PRODUCT][0][GROUP][TEST]" }

and then add a whole bunch of mutate+rename filters.

Hi @Badger,
Thanks. Could you tell me what's the purpose of [0] within a split filter?

The xml filter has a force_array option. If you do not set that to false, then [theXML][PRODUCT] will be an array, even though there is only one such element. The [0] references the first (and only) entry in the array.

I see, thank you. I noticed that Product is an array, which is slightly inconvenient, since the easiest way to handle this data set is to have it flatten. With this xml as a example, how can i make sure that each event is stored as a separate event, but with no arrays? Instead, the index should have each value stored as a separate field.

Here's the part of current elastic index:

theXML.PLANT	  {
  "LINE": "PRODLINE1",
  "NAME": "PARIS"
}
     	?  theXML.PRODUCT	  {
  "GROUP": {
    "TEST": {
      "VALUE": "2.34523",
      "NAME": "VOLTAGETEST",
      "STATUS": "OK",
      "ID": "10"
    },
    "NAME": "TESTGROUP1",
    "ID": "1"
  },
  "NAME": "MOTHERBOARD",
  "ID": "3"
}

2019-04-18%2015_11_33-Discover%20-%20Kibana

How could I achieve something like this?

{
theXML_PLANT_NAME: "PARIS",
theXML_PLANT_LINE: "PRODLINE1"
theXML_PRODUCT_ID: "3",
theXML_PRODUCT_NAME: "MOTHERBOARD",
theXML_GROUP_ID: "1",
theXML_GROUP_NAME: "TESTGROUP1",
theXML_TEST_ID: "10",
theXML_TEST_NAME: "VOLTAGETEST",
theXML_TEST_VALUE: "2_34523",
theXML_TEST_STATUS: "OK"
}

Edit:
I did it using mutate:

add_field => { "TEST_ID" => "%{[theXML][PRODUCT][0][GROUP][TEST][ID]}" }
add_field => { "TEST_NAME" => "%{[theXML][PRODUCT][0][GROUP][TEST][NAME]}" }
add_field => { "TEST_VALUE" => "%{[theXML][PRODUCT][0][GROUP][TEST][VALUE]}" }
add_field => { "TEST_STATUS" => "%{[theXML][PRODUCT][0][GROUP][TEST][STATUS]}" }

Result in elastic:
2019-04-18%2015_51_12-Discover%20-%20Kibana

Thanks again @Badger.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.