"delimiter" field behaving weirdly with \n char for the line codec with UTF-16

res260 · June 18, 2019, 7:48pm

Hi, I’m trying to parse UTF-16-encoded (with BOM), CRLF newlines CSV data from a TCP input, but I cant get it to be parsed correctly.

In my .conf file:

input {
	tcp {
		port => 5008
		type => "csv"
		codec => line {
			charset => "UTF-16LE"
			delimiter => "\r\n"
		}
	}
}

output {
	stdout{ codec => rubydebug }
}

When I import the data with ncat localhost 5008 < mydata.csv, I only get one giant event.

Thats where I started investigating. Reading the doc for the "delimiter" field ( https://www.elastic.co/guide/en/logstash/7.1/plugins-codecs-line.html ), it says that

Default value is "\n"

When I don’t set the delimiter field, I have other problems, but at least it separates events correctly. However, when I put delimiter => "\n" (which is the default value, it shouldn’t change the output), it parses as one big event.

This leads me to think that there might be a problem with the way that the "delimiter" field for the line codec handles \n and/or \r. Is there something I might have overlooked?

Badger · June 18, 2019, 7:55pm

Unless you have config.support_escapes enabled

delimiter => "\n"

is not the default

delimiter => "
"

is the default. You use a literal newline in the configuration file.

res260 · June 18, 2019, 8:22pm

That indeed seems to work for the \n when I enable this config.

However, it doesnt work with \r\n. I know that my lines are CRLF:

But I still get one big message. Any idea why?

Thanks.

res260 · June 18, 2019, 8:30pm

Using UTF-16 instead of UTF-16LE fixed this problem. EDIT: this is wrong, see my later messages.

Now, messages are separated correctly, but the content does not seem to be understood by logstash:

Badger · June 18, 2019, 8:38pm

Note that your data and your delimiters have opposite endianness. So if you get lines, all of your 16 bit characters have the bytes swapped. I believe it would be possible to fix that in a ruby filter.

res260 · June 18, 2019, 8:44pm

I see the same endianness, starts with 2 bytes representing the BOM, then little endian until 0D 00 0A 00, which is also little endian. Is that right?

Badger · June 18, 2019, 8:46pm

Yeah, I misread it.

res260 · June 18, 2019, 8:46pm

Also I made a mistake, changing to UTF-16 did not fix the problem, I still cant get the \r\n delimiter to work.

Badger · June 18, 2019, 8:49pm

How are you setting the delimiter. And what platform (Windows vs. UNIX) are you on?

res260 · June 18, 2019, 8:50pm

Current input:

input {
	tcp {
		port => 5008
		type => "csv"
		codec => line {
			charset => "UTF-16"
			delimiter => "\r\n"
		}
	}
}

Current output: one big event. I enabled config.support_escapes: true in logstash.yml.

res260 · June 18, 2019, 8:53pm

I’m on Ubuntu 18.04. The file comes from a windows machine.

Badger · June 18, 2019, 9:04pm

Looking at the code, the tokenizer knows nothing about the charset. So you need your delimiter to be '\r\0\n\0'.

There is a PR open to add support for \0, but at the moment I think you are out of luck. I do not know if @yaauie plans to merge this in a future version.

You could get rid of the codec and tokenize it yourself in a ruby filter.

res260 · June 18, 2019, 9:21pm

Alright I’ll take a look at that. Many thanks for your time!

system · July 16, 2019, 9:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Special characters not supported in delimiter of json_lines codec Logstash	1	1016	February 5, 2018
File input plugin - Delimiter property not working Logstash	1	665	July 20, 2021
Logstash CSV Filter Using Unicode delimiter ( SOH ) Logstash	5	4441	July 6, 2017
Custom file delimiter Logstash	2	2473	June 11, 2019
Multiline codec issue Logstash	1	1077	July 6, 2017

"delimiter" field behaving weirdly with \n char for the line codec with UTF-16

Related topics