"delimiter" field behaving weirdly with \n char for the line codec with UTF-16

Hi, I’m trying to parse UTF-16-encoded (with BOM), CRLF newlines CSV data from a TCP input, but I cant get it to be parsed correctly.

In my .conf file:

input {
	tcp {
		port => 5008
		type => "csv"
		codec => line {
			charset => "UTF-16LE"
			delimiter => "\r\n"
		}
	}
}

output {
	stdout{ codec => rubydebug }
}

When I import the data with ncat localhost 5008 < mydata.csv, I only get one giant event.

Thats where I started investigating. Reading the doc for the "delimiter" field ( https://www.elastic.co/guide/en/logstash/7.1/plugins-codecs-line.html ), it says that

Default value is "\n"

When I don’t set the delimiter field, I have other problems, but at least it separates events correctly. However, when I put delimiter => "\n" (which is the default value, it shouldn’t change the output), it parses as one big event.

This leads me to think that there might be a problem with the way that the "delimiter" field for the line codec handles \n and/or \r. Is there something I might have overlooked?

Unless you have config.support_escapes enabled

delimiter => "\n"

is not the default

delimiter => "
"

is the default. You use a literal newline in the configuration file.

That indeed seems to work for the \n when I enable this config.

However, it doesnt work with \r\n. I know that my lines are CRLF:


But I still get one big message. Any idea why?

Thanks.

Using UTF-16 instead of UTF-16LE fixed this problem. EDIT: this is wrong, see my later messages.

Now, messages are separated correctly, but the content does not seem to be understood by logstash: image

Note that your data and your delimiters have opposite endianness. So if you get lines, all of your 16 bit characters have the bytes swapped. I believe it would be possible to fix that in a ruby filter.

I see the same endianness, starts with 2 bytes representing the BOM, then little endian until 0D 00 0A 00, which is also little endian. Is that right?

Yeah, I misread it.

Also I made a mistake, changing to UTF-16 did not fix the problem, I still cant get the \r\n delimiter to work.

How are you setting the delimiter. And what platform (Windows vs. UNIX) are you on?

Current input:

input {
	tcp {
		port => 5008
		type => "csv"
		codec => line {
			charset => "UTF-16"
			delimiter => "\r\n"
		}
	}
}

Current output: one big event. I enabled config.support_escapes: true in logstash.yml.

I’m on Ubuntu 18.04. The file comes from a windows machine.

Looking at the code, the tokenizer knows nothing about the charset. So you need your delimiter to be '\r\0\n\0'.

There is a PR open to add support for \0, but at the moment I think you are out of luck. I do not know if @yaauie plans to merge this in a future version.

You could get rid of the codec and tokenize it yourself in a ruby filter.

Alright I’ll take a look at that. Many thanks for your time!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.