I am able to reproduce the issue on Linux. I create a file foo.txt that contains
1 line text file...
2 line text file...
I read this using
file {
path => "/home/user/foo.txt"
sincedb_path => "/dev/null"
start_position => beginning
codec => multiline {
pattern => "^Spalanzani"
negate => true
what => previous
auto_flush_interval => 1
multiline_tag => ""
}
}
and run it through
ruby {
code => '
File.open("/home/user/foo2.txt", "wt", encoding: "UTF-16") do |f|
f.puts event.get("message")
end
'
}
the resulting file looks like this when I dump it using 'od -ha'.
0000000 fffe 3100 2000 6c00 6900 6e00 6500 2000
~ del nul 1 nul sp nul l nul i nul n nul e nul sp
0000020 7400 6500 7800 7400 2000 6600 6900 6c00
nul t nul e nul x nul t nul sp nul f nul i nul l
0000040 6500 2e00 2e00 2e00 0a00 3200 2000 6c00
nul e nul . nul . nul . nul nl nul 2 nul sp nul l
0000060 6900 6e00 6500 2000 7400 6500 7800 7400
nul i nul n nul e nul sp nul t nul e nul x nul t
0000100 2000 6600 6900 6c00 6500 2e00 2e00 2e00
nul sp nul f nul i nul l nul e nul . nul . nul .
0000120 0a00
Which looks OK to me. I then read that using
file {
path => "/home/user/foo2.txt"
sincedb_path => "/dev/null"
start_position => beginning
codec => plain { charset => "UTF-16" }
}
and I get two events
"message" => "1 line text file...�"
"message" => "��������������������"
and I think that extra character suggests what the bug might be.
The problem is that the parsing into lines is done before the charset is applied. So the file input (on Linux) it reads up to the first \n (41 bytes) and returns it. But it fails to consume the NUL that is the second 8 bits of the newline character. So now the rest of the file has effectively flipped from UTF-16BE to UTF-16LE, and none of it can be decoded.
The details may be slightly different on Windows, but the presence of that \r on your first event tells me it is not parsing line endings correctly.
Realistically that's not going to get fixed, so it should probably be documented.
A potential workaround would be to lie to the codec about the endianess, which might get you all except the first line, but I haven't tested that.