Ruby logstash filter

I have a message contains Unicode Escape Sequence

I want convert it to UTF-8 character with my country language (VIetnamese)

Input is from filebeat filestream

I use logstash to parse the message:

\u0043\u1ea3\u006d\u0020\u01a1\u006e\u0020\u0071\u0075\u00fd\u0020\u006b\u0068\u00e1\u0063\u0068

Expect result is:

"Cảm ơn quý khách"

I have write simple ruby script and test and it work:

require 'uri'
message = "\u0043\u1ea3\u006d\u0020\u01a1\u006e\u0020\u0071\u0075\u00fd\u0020\u006b\u0068\u00e1\u0063\u0068"
enc_uri = URI.decode_www_form_component(message)
p enc_uri

But when i push it in to ruby filter in logstash and i puts the result out to testing, it's not work

filter {
    ruby {
        init => "require 'uri'"
        code => "
        @enc_uri = enc_uri = URI.decode_www_form_component(event.get('message'))
        puts @enc_uri
        "
    }
}

Unexpected results:

## This line , expect: `"Cảm ơn quý khách"`
\u0043\u1ea3\u006d\u0020\u01a1\u006e\u0020\u0071\u0075\u00fd\u0020\u006b\u0068\u00e1\u0063\u0068
{
       "message" => "\\u0043\\u1ea3\\u006d\\u0020\\u01a1\\u006e\\u0020\\u0071\\u0075\\u00fd\\u0020\\u006b\\u0068\\u00e1\\u0063\\u0068",
         "event" => {
        "original" => "\\u0043\\u1ea3\\u006d\\u0020\\u01a1\\u006e\\u0020\\u0071\\u0075\\u00fd\\u0020\\u006b\\u0068\\u00e1\\u0063\\u0068"
    },
           "ecs" => {
        "version" => "8.0.0"
    },
         "input" => {
        "type" => "filestream"
    },
         "agent" => {
                "type" => "filebeat",
        "ephemeral_id" => "69ccd3be-66c2-45ab-8ac8-e585698c7a0a",
                "name" => "2285d6af9a56",
             "version" => "8.5.2",
                  "id" => "a009634c-6ee6-487b-8d2b-87cf5c0cd7ec"
    },
          "host" => {
        "name" => "2285d6af9a56"
    },
      "@version" => "1",
           "log" => {
          "file" => {
            "path" => "/var/log/test/api.log"
        },
          "type" => "api",
        "offset" => 63342
    },
    "@timestamp" => 2022-12-03T08:33:53.754Z,
           "biz" => true,
          "tags" => [
        [0] "beats_input_codec_plain_applied"
    ]
}

Please help me explain this, and how to make it work

If you use a configuration that creates that [message] field with a json codec

    input { generator { count => 1 lines => [ '{ "message": "\u0043\u1ea3\u006d bar1\u0020\u01a1\u006e\u0020\u0071\u0075\u00fd\u0020\u006b\u0068\u00e1\u0063\u0068" }' ] codec => json } }

then you will get

   "message" => "Cảm ơn quý khách",

With a configuration like

input { generator { count => 1 lines => [ '\u0043\u1ea3\u006d\u0020\u01a1\u006e\u0020\u0071\u0075\u00fd\u0020\u006b\u0068\u00e1\u0063\u0068' ] } }

the problem is that all the backslashes get escaped, so that you end up with \\u0043.... That's not URI encoding.

What we can do is walk through the message field looking for \u followed by four hex digits, and then convert the four hex numbers into an integer in network byte order and uudecode it (I think, I copied it from an SO answer)

    ruby {
        code => '
            event.set("someField", event.get("message").gsub(/\\u([\da-fA-F]{4})/) {|x| [$1].pack("H*").unpack("n*").pack("U*")})
        '
    }

results in

 "someField" => "Cảm ơn quý khách"

And yes, I could have overwritten [message] in the event.set call.

Note that since we are doing gsub on parts of the string that are \u plus four hex characters, other unencoded text before, after, or in between those parts is unaffected. So a message field containing

He responded "\u0043\u1ea3...

would result in

He responded "Cảm ơn...
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.