Trying to parse xml with \r in the xml message field

I try to read an XML file with Logstash. But the XML is only read until the first \r. Shouldn't everything be read with Multiline? Or how can I exclude that message field?

XML File
<?xml version="1.0" encoding="utf-8"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet,  
consetetur sadipscing elitr,
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua.</message><origin>HR K</origin><status><type>5</type><timestamp>2018-06-18T01:08:58+02:00</timestamp></status><object><type>Location</type><address><street>street</street><streetnumber>101</streetnumber><zip>8888</zip><city>City</city></address><name>street 10</name><coords><lat>47.00000</lat><lon>8.00000</lon></coords></object><object><type>Destination</type><name>HAUPTGEBÄUDE</name></object></update>
My configuration looks the following:
input
{
	file
	{
		path => "C:/Temp/SRZ/Probleme180618/test/*.xml"
		sincedb_path => "nul"
		start_position => "beginning"
		type => "xml"
		codec => multiline {
			pattern => "" 
			negate => "true"
			what => "previous"
		}
	}
}
filter
{
	xml
	{
		source => "message"
		store_xml => false
		target => "protocol"
        force_array => false
		xpath => [
			"//protocol/number/text()", "protocol_number",
			"//protocol/assignment/text()", "protocol_assignment",
			"//protocol/timestamp/text()", "protocol_timestamp",
			"//protocol/calltime/text()", "protocol_calltime",
			"//status/type/text()", "status_type"
		]
	}

}
output
{
	file {
		path => "C:/Temp/SRZ/output.txt"
		codec => line { format => "Nr: %{protocol_number} Assignment: %{protocol_assignment} Timestamp: %{protocol_timestamp} CallTime: %{protocol_calltime} StatusType: %{status_type}"}
	}
	stdout
	{
		codec => rubydebug
	}
} 
Logstash Output
{
      "@version" => "1",
          "path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]__T_200.xml",
          "host" => "acec-lub01",
          "type" => "xml",
       "message" => "consetetur sadipscing elitr,\r",
    "@timestamp" => 2018-11-13T08:03:37.300Z
}
{
               "@version" => "1",
     "protocol_timestamp" => [
        [0] "2018-06-18T00:21:54+02:00"
    ],
                   "path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]__T_200.xml",
                   "host" => "acec-lub01",
                   "type" => "xml",
        "protocol_number" => [
        [0] "S18085936"
    ],
                "message" => "<?xml version=\"1.0\" encoding=\"utf-8\"?><update xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet,  \r",
             "@timestamp" => 2018-11-13T08:03:37.253Z,
    "protocol_assignment" => [
        [0] "D18051009"
    ]
}

That does not look like valid XML, so I am not surprised the XML filter does not work. I would recommend you either correct the input data or parse the data as text.

Wouldn't it be possible to correct the XML in the logstash before doing filtering?

I see that you updated the data and that it now looks like valid XML. Does the file contain a single XML document spread over multiple lines or can it contain more than one?

Files are always looking the same as you see in the "XML File" example, so yes, this is a single xml document with multiple lines. As you can see at there are CR inside the XML Tag.

Try changing your multiline pattern to <\?xml.* or <?xml (Don't remember if it takes regular expression or not). Afterwards, before your XML filter, use the mutate filter's gsub function to remove the \r carriage return.

Thank you for your response. I have tried that with the following configuration:

Logstash configuration
input
{
	file
	{
		path => "C:/Temp/SRZ/Probleme180618/test/*.xml"
		sincedb_path => "nul"
		start_position => "beginning"
		type => "xml"
		codec => multiline {
			pattern => "<\?xml.*" 
			negate => "false"
			what => "previous"
		}
	}
}
filter{
	mutate { 
		gsub => [ 
			"message", "[\r]", "",
			"message", "[\n]", ""
		] 
	}
}
filter
{
	xml
	{
		source => "message"
		store_xml => false
		target => "protocol"
        force_array => false
		xpath => [
			"//protocol/number/text()", "protocol_number",
			"//protocol/assignment/text()", "protocol_assignment",
			"//protocol/timestamp/text()", "protocol_timestamp",
			"//protocol/calltime/text()", "protocol_calltime",
			"//status/type/text()", "status_type"
		]
	}

}
output
{
	file {
		path => "C:/Temp/SRZ/output.txt"
		codec => line { format => "custom format: %{message}" }
	}
	stdout
	{
		codec => rubydebug
	}
} 

I got the following output in the console:

[2018-11-14T11:18:11,386][INFO ][logstash.outputs.file ] Opening file {:path=>"C:/Temp/SRZ/output.txt"}
{
"type" => "xml",
"protocol_timestamp" => [
[0] "2018-06-18T00:21:54+02:00"
],
"@version" => "1",
"path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739_[SendUpdate]_T_200.xml",
"@timestamp" => 2018-11-14T10:18:10.672Z,
"host" => "acec-lub01",
"protocol_number" => [
[0] "S18085936"
],
"protocol_assignment" => [
[0] "D18051009"
],
"message" => "<?xml version=\"1.0\" encoding=\"utf-8\"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance\" xmlns:xsd="http://www.w3.org/2001/XMLSchema\">2018S18085936D180510091062018-06-18T00:21:54+02:001Lorem ipsum dolor sit amet, "
}
{
"type" => "xml",
"@version" => "1",
"path" => "C:/Temp/SRZ/Probleme180618/test/20180618010858_739
[SendUpdate]__T_200.xml",
"@timestamp" => 2018-11-14T10:18:10.714Z,
"host" => "acec-lub01",
"message" => "consetetur sadipscing elitr,"
}

and the following output in the file:

custom format: <?xml version="1.0" encoding="utf-8"?>2018S18085936D180510091062018-06-18T00:21:54+02:001Lorem ipsum dolor sit amet,
custom format: consetetur sadipscing elitr,

So looks that the mutate is not working/configured as expected. Any hints/advice?

I'm thinking you don't want your gsub pattern in brackets.

filter{
	mutate { 
		gsub => [ 
			"message", "\r", "",
			"message", "\n", ""
		] 
	}
}

Also, once you get it working, you can combine them onto a single line as "message", "\r|\n", ""

OK thx. I have also changed the "what => "previous" to next and now I get the 2nd line also in the output, but not the other lines:

<?xml version="1.0" encoding="utf-8"?><update xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"><protocol><year>2018</year><number>S18085936</number><assignment>D18051009</assignment><pager>106</pager><timestamp>2018-06-18T00:21:54+02:00</timestamp></protocol><keyword>1</keyword><message>Lorem ipsum dolor sit amet, consetetur sadipscing elitr,

the other 2 lines are missing:
sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat,
sed diam voluptua.</message>.....

Does anyone have any advice?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.