Cant analyze byte string ( b'{... \\u0633 ... }' )

momora9809 · July 25, 2020, 4:19pm

First of all, sorry for the title. I don't know what should I call that (byte string?).

I have a data that is like this:

b'{"id": "2", "words": ["\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc", "\\u0634\\u0631\\u0648\\u0639"], "content":  "#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627"}'

The data is coming from NSQ and it is all I'm told (!). I don't know why it's like this (they said cuz of NSQ output).
Logstash config:

input { stdin {} }

filter {
    bytes{
        source => "message"
    }
}

output { 
    elasticsearch {
        hosts => ["localhost:9200"]
	    index => "mydata"
	} 
}

Analyzer in Kibana:

PUT /text2
{
  "settings": {
    "analysis": {
      "char_filter": {
        "zero_width_spaces": {
            "type":       "mapping",
            "mappings": [ "\\u200C=>\\u0020"] 
        }
      },
      "filter": {
        "persian_stop": {
          "type":       "stop",
          "stopwords":  "_persian_" 
        }
      },
      "analyzer": {
        "rebuilt_persian": {
          "tokenizer":     "standard",
          "char_filter": [ "zero_width_spaces" ],
          "filter": [
		    "asciifolding",
            "lowercase",
            "decimal_digit",
            "arabic_normalization",
            "persian_normalization",
            "persian_stop"
          ]
        }
      }
    }
  }
}

Unfortunately, it doesn't translate "\u0633..." to Persian alphabet.
But if instead, I set it like this:

filter {
    json {
        source => "message"
    }
}

and change the input to this:

{"id": "2", "words": ["\u0633\u0647\u0627\u0645\u062f\u0627\u0631\u06cc", "\u0634\u0631\u0648\u0639"], "content":  "#\u0648\u0644\u0633\u0627\u067e\u0627"}

it works fine.

I don't know what I'm doing wrong, and again sorry I sound like a total newbie.

Badger · July 26, 2020, 9:40pm

The bytes filter does not do what you are hoping. It is used to change (for example) 1Kb into 1024.

OK, so if the input line is not in the right format you can fix it. If you configure your filters as

    mutate {
        gsub => [
            "message", "^b'", "",
            "message", "'$", "",
            "message", "([\\])[\\]", "\1"
         ]
    }
    json { source => "message" }

you will get

     "words" => [
    [0] "سهامداری",
    [1] "شروع"
],
   "content" => "#ولساپا",

and Google translate confiirms that at least the words array is in Persian.

The third line of the mutate, which changes \\ to \ in the string may look strange, but I can explain. You are not required to understand, or even read, the explanation to get your filter to work.

In a logstash filter configuration, you cannot have \ at the end of a string, because the configuration compiler sees \" and assume it is escaping the double quote to prevent it ending the string. When replacing a backslash in a string the standard trick is to use a character group that consists of a single backslash. Thus to replace a single backslash with ! you would use

mutate { gsub => [ "message", "[\\]", "!" ] }

In your case I could have use that character group and specified it should occur twice using "[\\]{2}", but then I would have to use "\" in the replacement string and we would hit the problem where you cannot have a backslash at the end of a string again. So instead I use two character groups and capture the first using () so that I can use the capture group in the replacement string (\1 refers to the first (and only) capture group).

momora9809 · July 27, 2020, 3:09am

Thank you.
I'm using

b'{"id": "2", "words": ["\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc", "\\u0634\\u0631\\u0648\\u0639"], "content": "#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627"}'

as input but getting this error

[2020-07-27T07:37:01,215][WARN ][logstash.filters.json    ]
[main[e98b3b561145c39a1a9da545e3d90085aaead237423e9fca2815d3498d8132b5]
Error parsing json {:source=>"message", :raw=>"{\"id\": \"2\", \"words\": [\"\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc\", \"\\u0634\\u0631\\u0648\\u0639\"], \"content\":  \"#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627\"}'\r", :exception=>#<LogStash::Json::ParserError: Unexpected character (''' (code 39)): expected a valid value (number, String, array, object, 'true', 'false' or 'null') 

at [Source: (byte[])"{"id": "2", "words": ["\u0633\u0647\u0627\u0645\u062f\u0627\u0631\u06cc", "\u0634\u0631\u0648\u0639""; line: 1, column: 157]>}644\u0633\u0627\u067e\u0627"}'
{ 
"message" => "{\"id\": \"2\", \"words\": [\"\\u0633\\u0647\\u0627\\u0645\\u062f\\u0627\\u0631\\u06cc\", \"\\u0634\\u0631\\u0648\\u0639\"], \"content\":  \"#\\u0648\\u0644\\u0633\\u0627\\u067e\\u0627\"}'\r",
"@version" => "1",
"host" => "DESKTOP-A", 
"tags" => [
[0] "_jsonparsefailure"],
"@timestamp" => 2020-07-27T03:07:01.096Z
}

Badger · July 27, 2020, 11:21am

OK, so you have a \r at the end of the input message, which means "'$" will not match, because the ' is not at the end of the string. You could try replacing the middle line of the gsub with

"message", "'", "",

momora9809 · July 28, 2020, 5:43am

Thanks, it works!
BTW, I didn't put that \r myself! Where did it come from?!

Badger · July 28, 2020, 4:34pm

\r is a newline. Perhaps you are processing a file with Windows line endings on a UNIX server?

momora9809 · July 29, 2020, 4:27am

I'm using Windows and both Elastic and Kibana servers are on windows too (local), but that line came from Ubuntu. Maybe that's the reason.
Thanks!

system · August 26, 2020, 4:27am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Logstash parshing unicode characters Logstash	17	3450	March 23, 2022
Logstash json parse error Logstash	16	29867	March 13, 2017
LogStash::Json::ParserError Logstash	10	997	November 17, 2020
Logstash JSON parser error Logstash	2	976	July 6, 2017
Pre-process message backslashes prior to JSON filter Logstash	4	1546	March 26, 2019

Cant analyze byte string ( b'{... \\u0633 ... }' )

Related topics