Why stdout output shows message => "?"

Hello all,

i am very new to logstash. I am just testing some filters to extract content from txt files without any structure. Why is this happening?

On Windows enviroment

logstash config file

input {

  file {

    path => "C:/test/*.txt"
    start_position => "beginning"
    codec => plain {
       charset => "UTF-16"
    }

  }
}

filter {
   
}

output {

  stdout {}

}

first text line of the file:

echo "first line text file..." >> file.txt

message content in log output is correct:

{
          "path" => "C:/test/file.txt",
          "message" => "first line text file...\r",
          "host" => "DESKTOP",
          "tags" => [
        [0] "_grokparsefailure"
    ],
    "@timestamp" => 2020-01-07T14:55:37.172Z,
      "@version" => "1"
}

a second line added to the file (and next ones):

echo "second line text file..." >> file.txt

why message content in log output shows "?"

{
          "path" => "C:/test/file.txt",
          "message" => "??????????????????????????",
          "host" => "DESKTOP",
          "tags" => [
        [0] "_grokparsefailure"
    ],
    "@timestamp" => 2020-01-07T15:11:49.747Z,
      "@version" => "1"
}

Best regards,
nino

No way, only appears new text in message field if there is just one line in the text file, if a new line is added this is the log

"message" => "?????????????????????"

My guess is that this is related to you setting the encoding to UTF-16. Although I cannot think why that would only impact the second line. If you change the charset on the input to UTF-8 then what happens?

Hi Badger,

codec => plain {
       charset => "UTF-8"
    }

Even worse

"message" => "\\xFF\\xFE1\\u0000 \\u0000l\\u0000i\\u0000n\\u0000e\\u0000 \\u0000t\\u0000e\\u0000x\\u0000t\\u0000 \\u0000f\\u0000i\\u0000l\\u0000e\\u0000.\\u0000.\\u0000.\\u0000\\r\\u0000"

Nothing Badger, strange

UTF-16LE

{
       "message" => "?1 line text file...\r",
          "host" => "DESKTOP",
      "@version" => "1",
    "@timestamp" => 2020-01-07T20:25:44.995Z,
          "path" => "C:/test/file2.txt"
}
{
       "message" => "?????????????????????",
          "host" => "DESKTOP",
      "@version" => "1",
    "@timestamp" => 2020-01-07T20:26:10.794Z,
          "path" => "C:/test/file2.txt"
}

That looks good. It is little endian UTF-16 for "1 line text file...". What does the second event look like?

Thanks Badger, second event has to include something like:

cat  C:\test\file2.txt

1 line text file...
2 line text file...

"message" => "2 line text file..."

instead "????????"

What does the second event look like when you use the UTF-8 charset?

UTF-8

{
"host" => "DESKTOP",
"@version" => "1",
"@timestamp" => 2020-01-07T22:05:32.706Z,
"message" => "\xFF\xFE1\u0000 \u0000l\u0000i\u0000n\u0000e\u0000 \u0000t\u0000e\u0000x\u0000t\u0000 \u0000f\u0000i\u0000l\u0000e\u0000.\u0000.\u0000.\u0000\r\u0000",
"path" => "C:/test/file3.txt"
}
{
"host" => "DESKTOP",
"@version" => "1",
"@timestamp" => 2020-01-07T22:05:39.982Z,
"message" => "\u00002\u0000 \u0000l\u0000i\u0000n\u0000e\u0000 \u0000t\u0000e\u0000x\u0000t\u0000 \u0000f\u0000i\u0000l\u0000e\u0000.\u0000.\u0000.\u0000\r\u0000",
"path" => "C:/test//file3.txt"
}

I am able to reproduce the issue on Linux. I create a file foo.txt that contains

1 line text file...
2 line text file...

I read this using

file {
    path => "/home/user/foo.txt"
    sincedb_path => "/dev/null"
    start_position => beginning
    codec => multiline {
        pattern => "^Spalanzani"
        negate => true
        what => previous
        auto_flush_interval => 1
        multiline_tag => ""
    }
}

and run it through

    ruby {
        code => '
            File.open("/home/user/foo2.txt", "wt", encoding: "UTF-16") do |f|
                f.puts event.get("message")
            end
        '
    }

the resulting file looks like this when I dump it using 'od -ha'.

0000000    fffe    3100    2000    6c00    6900    6e00    6500    2000
          ~ del nul   1 nul  sp nul   l nul   i nul   n nul   e nul  sp
0000020    7400    6500    7800    7400    2000    6600    6900    6c00
        nul   t nul   e nul   x nul   t nul  sp nul   f nul   i nul   l
0000040    6500    2e00    2e00    2e00    0a00    3200    2000    6c00
        nul   e nul   . nul   . nul   . nul  nl nul   2 nul  sp nul   l
0000060    6900    6e00    6500    2000    7400    6500    7800    7400
        nul   i nul   n nul   e nul  sp nul   t nul   e nul   x nul   t
0000100    2000    6600    6900    6c00    6500    2e00    2e00    2e00
        nul  sp nul   f nul   i nul   l nul   e nul   . nul   . nul   .
0000120    0a00

Which looks OK to me. I then read that using

file {
    path => "/home/user/foo2.txt"
    sincedb_path => "/dev/null"
    start_position => beginning
    codec => plain { charset => "UTF-16" }
}

and I get two events

   "message" => "1 line text file...�"
   "message" => "��������������������"

and I think that extra character suggests what the bug might be.

The problem is that the parsing into lines is done before the charset is applied. So the file input (on Linux) it reads up to the first \n (41 bytes) and returns it. But it fails to consume the NUL that is the second 8 bits of the newline character. So now the rest of the file has effectively flipped from UTF-16BE to UTF-16LE, and none of it can be decoded.

The details may be slightly different on Windows, but the presence of that \r on your first event tells me it is not parsing line endings correctly.

Realistically that's not going to get fixed, so it should probably be documented.

A potential workaround would be to lie to the codec about the endianess, which might get you all except the first line, but I haven't tested that.

1 Like

Thank you so much Badger for the analysis. I'll try another output then. I'll try with another output instead stdout. Maybe output to a file and cat.

Best regards

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.