Why stdout output shows message => "?"

I am able to reproduce the issue on Linux. I create a file foo.txt that contains

1 line text file...
2 line text file...

I read this using

file {
    path => "/home/user/foo.txt"
    sincedb_path => "/dev/null"
    start_position => beginning
    codec => multiline {
        pattern => "^Spalanzani"
        negate => true
        what => previous
        auto_flush_interval => 1
        multiline_tag => ""
    }
}

and run it through

    ruby {
        code => '
            File.open("/home/user/foo2.txt", "wt", encoding: "UTF-16") do |f|
                f.puts event.get("message")
            end
        '
    }

the resulting file looks like this when I dump it using 'od -ha'.

0000000    fffe    3100    2000    6c00    6900    6e00    6500    2000
          ~ del nul   1 nul  sp nul   l nul   i nul   n nul   e nul  sp
0000020    7400    6500    7800    7400    2000    6600    6900    6c00
        nul   t nul   e nul   x nul   t nul  sp nul   f nul   i nul   l
0000040    6500    2e00    2e00    2e00    0a00    3200    2000    6c00
        nul   e nul   . nul   . nul   . nul  nl nul   2 nul  sp nul   l
0000060    6900    6e00    6500    2000    7400    6500    7800    7400
        nul   i nul   n nul   e nul  sp nul   t nul   e nul   x nul   t
0000100    2000    6600    6900    6c00    6500    2e00    2e00    2e00
        nul  sp nul   f nul   i nul   l nul   e nul   . nul   . nul   .
0000120    0a00

Which looks OK to me. I then read that using

file {
    path => "/home/user/foo2.txt"
    sincedb_path => "/dev/null"
    start_position => beginning
    codec => plain { charset => "UTF-16" }
}

and I get two events

   "message" => "1 line text file...�"
   "message" => "��������������������"

and I think that extra character suggests what the bug might be.

The problem is that the parsing into lines is done before the charset is applied. So the file input (on Linux) it reads up to the first \n (41 bytes) and returns it. But it fails to consume the NUL that is the second 8 bits of the newline character. So now the rest of the file has effectively flipped from UTF-16BE to UTF-16LE, and none of it can be decoded.

The details may be slightly different on Windows, but the presence of that \r on your first event tells me it is not parsing line endings correctly.

Realistically that's not going to get fixed, so it should probably be documented.

A potential workaround would be to lie to the codec about the endianess, which might get you all except the first line, but I haven't tested that.

1 Like