Trying to parse xml with \r in the xml message field

lukas.bayard · November 13, 2018, 8:09am

I try to read an XML file with Logstash. But the XML is only read until the first \r. Shouldn't everything be read with Multiline? Or how can I exclude that message field?

XML File

<?xml version="1.0" encoding="utf-8"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet,  
consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua.</message><origin>HR K</origin><status><type>5</type><timestamp>2018-06-18T01:08:58+02:00</timestamp></status><object><type>Location</type><address><street>street</street><streetnumber>101</streetnumber><zip>8888</zip><city>City</city></address><name>street 10</name><coords><lat>47.00000</lat><lon>8.00000</lon></coords></object><object><type>Destination</type><name>HAUPTGEBÄUDE</name></object></update>

My configuration looks the following:

input
{
	file
	{
		path => "C:/Temp/SRZ/Probleme180618/test/*.xml"
		sincedb_path => "nul"
		start_position => "beginning"
		type => "xml"
		codec => multiline {
			pattern => "" 
			negate => "true"
			what => "previous"
		}
	}
}
filter
{
	xml
	{
		source => "message"
		store_xml => false
		target => "protocol"
        force_array => false
		xpath => [
			"//protocol/number/text()", "protocol_number",
			"//protocol/assignment/text()", "protocol_assignment",
			"//protocol/timestamp/text()", "protocol_timestamp",
			"//protocol/calltime/text()", "protocol_calltime",
			"//status/type/text()", "status_type"
		]
	}

}
output
{
	file {
		path => "C:/Temp/SRZ/output.txt"
		codec => line { format => "Nr: %{protocol_number} Assignment: %{protocol_assignment} Timestamp: %{protocol_timestamp} CallTime: %{protocol_calltime} StatusType: %{status_type}"}
	}
	stdout
	{
		codec => rubydebug
	}
}

Logstash Output

{
      "@version" => "1",
          "path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]__T_200.xml",
          "host" => "acec-lub01",
          "type" => "xml",
       "message" => "consetetur sadipscing elitr,\r",
    "@timestamp" => 2018-11-13T08:03:37.300Z
}
{
               "@version" => "1",
     "protocol_timestamp" => [
        [0] "2018-06-18T00:21:54+02:00"
    ],
                   "path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]__T_200.xml",
                   "host" => "acec-lub01",
                   "type" => "xml",
        "protocol_number" => [
        [0] "S18085936"
    ],
                "message" => "<?xml version=\"1.0\" encoding=\"utf-8\"?><update xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet,  \r",
             "@timestamp" => 2018-11-13T08:03:37.253Z,
    "protocol_assignment" => [
        [0] "D18051009"
    ]
}

Christian_Dahlqvist · November 13, 2018, 8:13am

That does not look like valid XML, so I am not surprised the XML filter does not work. I would recommend you either correct the input data or parse the data as text.

lukas.bayard · November 13, 2018, 8:15am

Wouldn't it be possible to correct the XML in the logstash before doing filtering?

Christian_Dahlqvist · November 13, 2018, 10:07am

I see that you updated the data and that it now looks like valid XML. Does the file contain a single XML document spread over multiple lines or can it contain more than one?

lukas.bayard · November 13, 2018, 10:22am

Files are always looking the same as you see in the "XML File" example, so yes, this is a single xml document with multiple lines. As you can see at there are CR inside the XML Tag.

wwalker · November 13, 2018, 8:40pm

Try changing your multiline pattern to <\?xml.* or <?xml (Don't remember if it takes regular expression or not). Afterwards, before your XML filter, use the mutate filter's gsub function to remove the \r carriage return.

lukas.bayard · November 14, 2018, 10:25am

Thank you for your response. I have tried that with the following configuration:

Logstash configuration

input
{
	file
	{
		path => "C:/Temp/SRZ/Probleme180618/test/*.xml"
		sincedb_path => "nul"
		start_position => "beginning"
		type => "xml"
		codec => multiline {
			pattern => "<\?xml.*" 
			negate => "false"
			what => "previous"
		}
	}
}
filter{
	mutate { 
		gsub => [ 
			"message", "[\r]", "",
			"message", "[\n]", ""
		] 
	}
}
filter
{
	xml
	{
		source => "message"
		store_xml => false
		target => "protocol"
        force_array => false
		xpath => [
			"//protocol/number/text()", "protocol_number",
			"//protocol/assignment/text()", "protocol_assignment",
			"//protocol/timestamp/text()", "protocol_timestamp",
			"//protocol/calltime/text()", "protocol_calltime",
			"//status/type/text()", "status_type"
		]
	}

}
output
{
	file {
		path => "C:/Temp/SRZ/output.txt"
		codec => line { format => "custom format: %{message}" }
	}
	stdout
	{
		codec => rubydebug
	}
}

I got the following output in the console:

[2018-11-14T11:18:11,386][INFO ][logstash.outputs.file ] Opening file {:path=>"C:/Temp/SRZ/output.txt"}
{
"type" => "xml",
"protocol_timestamp" => [
[0] "2018-06-18T00:21:54+02:00"
],
"@version" => "1",
"path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]_T_200.xml",
"@timestamp" => 2018-11-14T10:18:10.672Z,
"host" => "acec-lub01",
"protocol_number" => [
[0] "S18085936"
],
"protocol_assignment" => [
[0] "D18051009"
],
"message" => "<?xml version=\"1.0\" encoding=\"utf-8\"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd="http://www.w3.org/2001/XMLSchema\">2018S18085936D180510091062018-06-18T00:21:54+02:001Lorem ipsum dolor sit amet, "
}
{
"type" => "xml",
"@version" => "1",
"path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739[SendUpdate]__T_200.xml",
"@timestamp" => 2018-11-14T10:18:10.714Z,
"host" => "acec-lub01",
"message" => "consetetur sadipscing elitr,"
}

and the following output in the file:

custom format: <?xml version="1.0" encoding="utf-8"?>2018S18085936D180510091062018-06-18T00:21:54+02:001Lorem ipsum dolor sit amet,
custom format: consetetur sadipscing elitr,

So looks that the mutate is not working/configured as expected. Any hints/advice?

wwalker · November 16, 2018, 1:28am

I'm thinking you don't want your gsub pattern in brackets.

filter{
	mutate { 
		gsub => [ 
			"message", "\r", "",
			"message", "\n", ""
		] 
	}
}

Also, once you get it working, you can combine them onto a single line as "message", "\r|\n", ""

lukas.bayard · November 16, 2018, 7:44am

OK thx. I have also changed the "what => "previous" to next and now I get the 2nd line also in the output, but not the other lines:

<?xml version="1.0" encoding="utf-8"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet, consetetur sadipscing elitr,

the other 2 lines are missing:
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua.</message>.....

lukas.bayard · November 27, 2018, 4:47pm

Does anyone have any advice?

system · December 25, 2018, 4:47pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Read xml log files which has no line breaks and everything is written in one single line Logstash	2	685	February 24, 2017
Not able to parse custom logs having multi line xml Logstash	15	4300	December 22, 2017
Using logstash Logstash	8	1824	July 6, 2017
Logstash multiline filter not merging xml after new line Logstash	15	1783	July 6, 2017
Difficulties to parse/filter on xml file Logstash	3	1642	July 6, 2017

Trying to parse xml with \r in the xml message field

Related topics