Cant analyze byte string ( b'{... \\u0633 ... }' )

First of all, sorry for the title. I don't know what should I call that (byte string?).

I have a data that is like this:

b'{"id": "2", "words": ["\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc", "\\u0634\\u0631\\u0648\\u0639"], "content":  "#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627"}'

The data is coming from NSQ and it is all I'm told (!). I don't know why it's like this (they said cuz of NSQ output).
Logstash config:

input { stdin {} }

filter {
    bytes{
        source => "message"
    }
}

output { 
    elasticsearch {
        hosts => ["localhost:9200"]
	    index => "mydata"
	} 
}

Analyzer in Kibana:

PUT /text2
{
  "settings": {
    "analysis": {
      "char_filter": {
        "zero_width_spaces": {
            "type":       "mapping",
            "mappings": [ "\\u200C=>\\u0020"] 
        }
      },
      "filter": {
        "persian_stop": {
          "type":       "stop",
          "stopwords":  "_persian_" 
        }
      },
      "analyzer": {
        "rebuilt_persian": {
          "tokenizer":     "standard",
          "char_filter": [ "zero_width_spaces" ],
          "filter": [
		    "asciifolding",
            "lowercase",
            "decimal_digit",
            "arabic_normalization",
            "persian_normalization",
            "persian_stop"
          ]
        }
      }
    }
  }
}

Unfortunately, it doesn't translate "\u0633..." to Persian alphabet.
But if instead, I set it like this:

filter {
    json {
        source => "message"
    }
}

and change the input to this:

{"id": "2", "words": ["\u0633\u0647\u0627\u0645\u062f\u0627\u0631\u06cc", "\u0634\u0631\u0648\u0639"], "content":  "#\u0648\u0644\u0633\u0627\u067e\u0627"}

it works fine.

I don't know what I'm doing wrong, and again sorry I sound like a total newbie.

The bytes filter does not do what you are hoping. It is used to change (for example) 1Kb into 1024.

OK, so if the input line is not in the right format you can fix it. If you configure your filters as

    mutate {
        gsub => [
            "message", "^b'", "",
            "message", "'$", "",
            "message", "([\\])[\\]", "\1"
         ]
    }
    json { source => "message" }

you will get

     "words" => [
    [0] "سهامداری",
    [1] "شروع"
],
   "content" => "#ولساپا",

and Google translate confiirms that at least the words array is in Persian.

The third line of the mutate, which changes \\ to \ in the string may look strange, but I can explain. You are not required to understand, or even read, the explanation to get your filter to work.

In a logstash filter configuration, you cannot have \ at the end of a string, because the configuration compiler sees \" and assume it is escaping the double quote to prevent it ending the string. When replacing a backslash in a string the standard trick is to use a character group that consists of a single backslash. Thus to replace a single backslash with ! you would use

mutate { gsub => [ "message", "[\\]", "!" ] }

In your case I could have use that character group and specified it should occur twice using "[\\]{2}", but then I would have to use "\" in the replacement string and we would hit the problem where you cannot have a backslash at the end of a string again. So instead I use two character groups and capture the first using () so that I can use the capture group in the replacement string (\1 refers to the first (and only) capture group).

1 Like

Thank you.
I'm using

b'{"id": "2", "words": ["\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc", "\\u0634\\u0631\\u0648\\u0639"], "content": "#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627"}'

as input but getting this error

[2020-07-27T07:37:01,215][WARN ][logstash.filters.json    ]
[main[e98b3b561145c39a1a9da545e3d90085aaead237423e9fca2815d3498d8132b5]
Error parsing json {:source=>"message", :raw=>"{\"id\": \"2\", \"words\": [\"\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc\", \"\\u0634\\u0631\\u0648\\u0639\"], \"content\":  \"#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627\"}'\r", :exception=>#<LogStash::Json::ParserError: Unexpected character (''' (code 39)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') 

at [Source: (byte[])"{"id": "2", "words": ["\u0633\u0647\u0627\u0645\u062f\u0627\u0631\u06cc", "\u0634\u0631\u0648\u0639""; line: 1, column: 157]>}644\u0633\u0627\u067e\u0627"}'
{ 
"message" => "{\"id\": \"2\", \"words\": [\"\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc\", \"\\u0634\\u0631\\u0648\\u0639\"], \"content\":  \"#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627\"}'\r",
"@version" => "1",
"host" => "DESKTOP-A", 
"tags" => [
[0] "_jsonparsefailure"],
"@timestamp" => 2020-07-27T03:07:01.096Z
} 

OK, so you have a \r at the end of the input message, which means "'$" will not match, because the ' is not at the end of the string. You could try replacing the middle line of the gsub with

"message", "'", "",
1 Like

Thanks, it works!
BTW, I didn't put that \r myself! Where did it come from?!

\r is a newline. Perhaps you are processing a file with Windows line endings on a UNIX server?

1 Like

I'm using Windows and both Elastic and Kibana servers are on windows too (local), but that line came from Ubuntu. Maybe that's the reason.
Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.