So I am trying to parse an XML file which is a few thousand lines long, and looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<CONSOLIDATED_LIST xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="https://www.BALAHAALALALA.xsd" dateGenerated="2018-02-02T12:12:12.111-12:12">
<INDIVIDUALS>
<INDIVIDUAL>
<DATA>3333</DATA>
<VERSION>1</VERSION>
<FIRST_NAME>JO</FIRST_NAME>
<SECOND_NAME>SMITH</SECOND_NAME>
<THIRD_NAME />
<SOME_LIST_TYPE>AAA</SOME_LIST_TYPE>
<REFERENCE_NUMBER>111</REFERENCE_NUMBER>
<LISTED_ON>2014-01-01</LISTED_ON>
<COMMENTS>Hello</COMMENTS>
<NATIONALITY>
<VALUE>Hey</VALUE>
</NATIONALITY>
<LIST_TYPE>
<VALUE>some List</VALUE>
</LIST_TYPE>
<LAST_DAY_UPDATED>
<VALUE />
</LAST_DAY_UPDATED>
<NICKNAME>
<QUALITY />
<VALUE />
</NICKNAME>
<ADDRESS>
<COUNTRY />
</ADDRESS>
<DOB>
<DATE>1990-05-11</DATE>
</DOB>
<POB/>
</INDIVIDUAL>
<INDIVIDUAL>
<DATA>1111111</DATA>
<VERSION>2</VERSION>
<FIRST_NAME>BEEP</FIRST_NAME>
<SECOND_NAME>BOOP</SECOND_NAME>
<THIRD_NAME />
<SOME_LIST_TYPE>DJJ</SOME_LIST_TYPE>
<REFERENCE_NUMBER>4444</REFERENCE_NUMBER>
<LISTED_ON>2016-12-25</LISTED_ON>
<COMMENTS>ASDFJKLH DSGHJDSGH LAS GFOWE I .</COMMENTS>
<NATIONALITY>
<VALUE>ASDFS SDAF SDGSDAGSGDSG SDAG </VALUE>
</NATIONALITY>
<LIST_TYPE>
<VALUE>SOME List</VALUE>
</LIST_TYPE>
<LAST_DAY_UPDATED>
<VALUE />
</LAST_DAY_UPDATED>
<NICKNAME>
<QUALITY />
<VALUE />
</NICKNAME>
<ADDRESS>
<COUNTRY />
</ADDRESS>
<DOB>
<DATE>1990-01-01</DATE>
</DOB>
<POB/>
</INDIVIDUAL>
</INDIVIDUALS>
</CONSOLIDATED_LIST>
My input and filter sections look like this
input {
file {
id => "Ingest"
path => ["c:/ELK-Stack/un/small.xml"]
start_position => "beginning"
ignore_older => 0
sincedb_path => "NUL"
codec => multiline {
pattern => "<CONSOLIDATED_LIST>"
negate => "true"
what => "previous"
}
}
}
filter {
xml {
id => "Parse"
store_xml => false
source => "message"
target => "xml_content"
force_array => true
xpath => [
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/DATAID/text()", "DATA",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/VERSIONNUM/text()", "VERSION",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/FIRST_NAME/text()", "FIRST_NAME",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/SECOND_NAME/text()", "SECOND_NAME",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/THIRD_NAME/text()", "THIRD_NAME",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/UN_LIST_TYPE/text()", "SOME_LIST_TYPE",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/REFERENCE_NUMBER/text()", "REFERENCE_NUMBER",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/COMMENTS1/text()", "COMMENTS",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/NATIONALITY/VALUE/text()", "NATIONALITY",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/LIST_TYPE/VALUE/text()", "LIST_TYPE",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/LAST_DAY_UPDATED/VALUE/text()", "LAST_DAY_UPDATED",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/NICKNAME/QUALITY/text()", "QUALITY",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/NICKNAME/ALIAS_NAME/text()", "ALIAS_NAME",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/ADDRESS/COUNTRY/text()", "COUNTRY",
"CONSOLIDATED_LIST/INDIVIDUALS/INDIVIDUAL/DOB/DATE/text()", "DATE",
"CONSOLIDATED_LIST/INDIVIDUALS/POB/text()", "INDIVIDUAL_PLACE_OF_BIRTH"
]
}
mutate {
remove_field => ['@timestamp', 'message', 'host', '@version', 'path']
}
}
The output produced through the console is as follows
"NATIONALITY" => [
[0] "hey",
[1] "ASDFS SDAF SDGSDAGSGDSG SDAG "
],
"DATA" => [
[0] "3333",
[1] "1111111"
],
"SOME_LIST_TYPE" => [
[0] "AAA",
[1] "DJJ"
],
"COMMENTS" => [
[0] "Hello ."
],
"REFERENCE_NUMBER" => [
[0] "111",
[1] "4444"
],
"FIRST_NAME" => [
[0] "RI",
[1] "BEEP"
],
"tags" => [
[0] "multiline"
],
"SECOND_NAME" => [
[0] "WON HO",
[1] "BOOP"
],
"DATE" => [
[0] "1990-05-11",
[1] "1990-01-01"
],
"NICKNAME" => [
[0] "sup"
],
"VERSION" => [
[0] "1",
[1] "2"
],
"LIST_TYPE" => [
[0] "some List",
[1] "SOME List"
]
}
I would instead like the output to display something like this (variables aren't in order, so like a normal logstash console output):
"Nationality" => "hey",
"DATA" => "3333",
"VERSION" => "1",
"FIRST_NAME" => "RI",
etc..
"Nationality" => "ASDFS SDAF SDGSDAGSGDSG SDAG ",
"DATA" => "1111111",
"VERSION" => "2",
"FIRST_NAME" => "BEEP",
I think my problem might be with the multiline codec, or that I'm missing a step after I've completed the xpath, however I'm not too sure how to proceed. I've tried setting the multiline codec pattern at "" however this doesn't really help. I also can't seem to split my data, as I always get this error (Only String and Array types are splittable. field:INDIVIDUALS/INDIVIDUAL/CONSOLIDATED_LIST is of type = NilClass). I also cant convert the type to string...
I've had a look around the forum and tried various solutions but my result doesn't really seem to change to the output I'm after. The config file posted here looks promising but it doesn't really work for me either (Parsing xml using logstaash xpath)
Any help would be much appreciated.