Trying to parse xml with \r in the xml message field


(Lukas Bayard) #1

I try to read an XML file with Logstash. But the XML is only read until the first \r. Shouldn't everything be read with Multiline? Or how can I exclude that message field?

XML File
<?xml version="1.0" encoding="utf-8"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet,  
consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua.</message><origin>HR K</origin><status><type>5</type><timestamp>2018-06-18T01:08:58+02:00</timestamp></status><object><type>Location</type><address><street>street</street><streetnumber>101</streetnumber><zip>8888</zip><city>City</city></address><name>street 10</name><coords><lat>47.00000</lat><lon>8.00000</lon></coords></object><object><type>Destination</type><name>HAUPTGEBĂ„UDE</name></object></update>
My configuration looks the following:
input
{
	file
	{
		path => "C:/Temp/SRZ/Probleme180618/test/*.xml"
		sincedb_path => "nul"
		start_position => "beginning"
		type => "xml"
		codec => multiline {
			pattern => "" 
			negate => "true"
			what => "previous"
		}
	}
}
filter
{
	xml
	{
		source => "message"
		store_xml => false
		target => "protocol"
        force_array => false
		xpath => [
			"//protocol/number/text()", "protocol_number",
			"//protocol/assignment/text()", "protocol_assignment",
			"//protocol/timestamp/text()", "protocol_timestamp",
			"//protocol/calltime/text()", "protocol_calltime",
			"//status/type/text()", "status_type"
		]
	}

}
output
{
	file {
		path => "C:/Temp/SRZ/output.txt"
		codec => line { format => "Nr: %{protocol_number} Assignment: %{protocol_assignment} Timestamp: %{protocol_timestamp} CallTime: %{protocol_calltime} StatusType: %{status_type}"}
	}
	stdout
	{
		codec => rubydebug
	}
} 
Logstash Output
{
      "@version" => "1",
          "path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]__T_200.xml",
          "host" => "acec-lub01",
          "type" => "xml",
       "message" => "consetetur sadipscing elitr,\r",
    "@timestamp" => 2018-11-13T08:03:37.300Z
}
{
               "@version" => "1",
     "protocol_timestamp" => [
        [0] "2018-06-18T00:21:54+02:00"
    ],
                   "path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]__T_200.xml",
                   "host" => "acec-lub01",
                   "type" => "xml",
        "protocol_number" => [
        [0] "S18085936"
    ],
                "message" => "<?xml version=\"1.0\" encoding=\"utf-8\"?><update xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet,  \r",
             "@timestamp" => 2018-11-13T08:03:37.253Z,
    "protocol_assignment" => [
        [0] "D18051009"
    ]
}

(Christian Dahlqvist) #2

That does not look like valid XML, so I am not surprised the XML filter does not work. I would recommend you either correct the input data or parse the data as text.


(Lukas Bayard) #3

Wouldn't it be possible to correct the XML in the logstash before doing filtering?


(Christian Dahlqvist) #4

I see that you updated the data and that it now looks like valid XML. Does the file contain a single XML document spread over multiple lines or can it contain more than one?


(Lukas Bayard) #5

Files are always looking the same as you see in the "XML File" example, so yes, this is a single xml document with multiple lines. As you can see at there are CR inside the XML Tag.


(Walker) #6

Try changing your multiline pattern to <\?xml.* or <?xml (Don't remember if it takes regular expression or not). Afterwards, before your XML filter, use the mutate filter's gsub function to remove the \r carriage return.


(Lukas Bayard) #7

Thank you for your response. I have tried that with the following configuration:

Logstash configuration
input
{
	file
	{
		path => "C:/Temp/SRZ/Probleme180618/test/*.xml"
		sincedb_path => "nul"
		start_position => "beginning"
		type => "xml"
		codec => multiline {
			pattern => "<\?xml.*" 
			negate => "false"
			what => "previous"
		}
	}
}
filter{
	mutate { 
		gsub => [ 
			"message", "[\r]", "",
			"message", "[\n]", ""
		] 
	}
}
filter
{
	xml
	{
		source => "message"
		store_xml => false
		target => "protocol"
        force_array => false
		xpath => [
			"//protocol/number/text()", "protocol_number",
			"//protocol/assignment/text()", "protocol_assignment",
			"//protocol/timestamp/text()", "protocol_timestamp",
			"//protocol/calltime/text()", "protocol_calltime",
			"//status/type/text()", "status_type"
		]
	}

}
output
{
	file {
		path => "C:/Temp/SRZ/output.txt"
		codec => line { format => "custom format: %{message}" }
	}
	stdout
	{
		codec => rubydebug
	}
} 

I got the following output in the console:

[2018-11-14T11:18:11,386][INFO ][logstash.outputs.file ] Opening file {:path=>"C:/Temp/SRZ/output.txt"}
{
"type" => "xml",
"protocol_timestamp" => [
[0] "2018-06-18T00:21:54+02:00"
],
"@version" => "1",
"path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]_T_200.xml",
"@timestamp" => 2018-11-14T10:18:10.672Z,
"host" => "acec-lub01",
"protocol_number" => [
[0] "S18085936"
],
"protocol_assignment" => [
[0] "D18051009"
],
"message" => "<?xml version=\"1.0\" encoding=\"utf-8\"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">2018S18085936D180510091062018-06-18T00:21:54+02:001Lorem ipsum dolor sit amet, "
}
{
"type" => "xml",
"@version" => "1",
"path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739
[SendUpdate]__T_200.xml",
"@timestamp" => 2018-11-14T10:18:10.714Z,
"host" => "acec-lub01",
"message" => "consetetur sadipscing elitr,"
}

and the following output in the file:

custom format: <?xml version="1.0" encoding="utf-8"?>2018S18085936D180510091062018-06-18T00:21:54+02:001Lorem ipsum dolor sit amet,
custom format: consetetur sadipscing elitr,

So looks that the mutate is not working/configured as expected. Any hints/advice?


(Walker) #8

I'm thinking you don't want your gsub pattern in brackets.

filter{
	mutate { 
		gsub => [ 
			"message", "\r", "",
			"message", "\n", ""
		] 
	}
}

Also, once you get it working, you can combine them onto a single line as "message", "\r|\n", ""


(Lukas Bayard) #9

OK thx. I have also changed the "what => "previous" to next and now I get the 2nd line also in the output, but not the other lines:

<?xml version="1.0" encoding="utf-8"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet, consetetur sadipscing elitr,

the other 2 lines are missing:
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua.</message>.....


(Lukas Bayard) #10

Does anyone have any advice?