Logstash parsing XML failing on second and subsequent records

Hi, I'm struggling to understand why logstash is unable to process any records beyond the first in my XML log file. The first record is parsed fine, but then any following ones get the "_xmlparsefailure" tag.

My log file is in this format:

<?xml version="1.0" encoding="utf-16"?><Log><DateTime>2022-07-04T10:40:49.2352382+01:00</DateTime><Level>Information</Level><ServiceGUID>6d62412d-7a7e-4087-9e0c-38f665474839</ServiceGUID></Log>&#xD;
<?xml version="1.0" encoding="utf-16"?><Log><DateTime>2022-07-04T10:40:49.2382446+01:00</DateTime><Level>Information</Level><ServiceGUID>6d62412d-7a7e-4087-9e0c-38f665474839</ServiceGUID></Log>&#xD;
<?xml version="1.0" encoding="utf-16"?><Log><DateTime>2022-07-04T10:40:59.3376527+01:00</DateTime><Level>Error</Level><ServiceGUID>72d0d523-662c-4545-899f-571f3969a441</ServiceGUID></Log>&#xD;

and this is my logstash config file:

input {
	file {
		path => "C:/temp/Log_Files/*.xml"
		start_position => "beginning"
		sincedb_path => "NUL"
		codec => plain {
			charset => "UTF-16"
		}
	}
}
filter {
	xml {
		source => "message"
		target => "parsed"
	}
}
output {
	stdout {
		codec => rubydebug
	}
}

I've tried the same log file without the text at the end of each line and that doesn't seem to make any difference (not sure what that random text is anyway.)

The output from the first record looks like this:

{
        "parsed" => {
              "Level" => [
            [0] "Information"
        ],
           "DateTime" => [
            [0] "2022-07-04T10:40:49.2352382+01:00"
        ],
        "ServiceGUID" => [
            [0] "6d62412d-7a7e-4087-9e0c-38f665474839"
        ]
    },
          "host" => {
        "name" => "HOSTNAME"
    },
    "@timestamp" => 2022-07-05T12:13:07.202461500Z,
      "@version" => "1",
           "log" => {
        "file" => {
            "path" => "C:/temp/Log_Files/2022-07-04T00.00#2022-07-05T00.00.xml"
        }
    },
       "message" => "<?xml version=\"1.0\" encoding=\"utf-16\"?><Log><DateTime>2
022-07-04T10:40:49.2352382+01:00</DateTime><Level>Information</Level><ServiceGUI
D>6d62412d-7a7e-4087-9e0c-38f665474839</ServiceGUID></Log>&#xD;\r",
         "event" => {
        "original" => "<?xml version=\"1.0\" encoding=\"utf-16\"?><Log><DateTime
>2022-07-04T10:40:49.2352382+01:00</DateTime><Level>Information</Level><ServiceG
UID>6d62412d-7a7e-4087-9e0c-38f665474839</ServiceGUID></Log>&#xD;\r"
    }
}

but the output from the second and subsequent records looks like this:

{
          "host" => {
        "name" => "HOSTNAME"
    },
    "@timestamp" => 2022-07-05T12:13:07.210459300Z,
          "tags" => [
        [0] "_xmlparsefailure"
    ],
      "@version" => "1",
           "log" => {
        "file" => {
            "path" => "C:/temp/Log_Files/2022-07-04T00.00#2022-07-05T00.00.xml"
        }
    },
       "message" => "???????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????
??????????????????????????????????????????????????????",
         "event" => {
        "original" => "?????????????????????????????????????????????????????????
????????????????????????????????????????????????????????????????????????????????
????????????????????????????????????????????????????????"
    }
}

Anyone able to shed any light on what is missing or wrong?
Thank you.

I suspect you are hitting this issue. The file input reads lines, and splitting the data into lines happens before the codec (and encoding) is applied, so it cannot properly process 16 bit characters.

Thanks Badger. I'll go back to our developers and see if we can get a different log output.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.