Kuromoji: Tokenization of ゴロンと is Unexpected (incorrect?)


#1

Is the below a bug or expected behavior for Kuromoji? The behavior occurs on default settings and even on the Kuromoji demo page.

Kuromjiでは下記はバグか想定通りかを確認いただけますでしょうか。デフォルト設定でもKuromojiのデモページでも下記の通りになります。

Using Kuromoji as configured below.

				"ja_tokenizer":{
				"type":"kuromoji_tokenizer",
				"mode":"search",
				"discard_punctuation": "false"
				}

Tokenization for ゴロンと is:

{
"tokens": [
    {
        "token": "ゴロ",
        "start_offset": 0,
        "end_offset": 2,
        "type": "word",
        "position": 0
    },
    {
        "token": "ン",
        "start_offset": 2,
        "end_offset": 3,
        "type": "word",
        "position": 1
    },
    {
        "token": "と",
        "start_offset": 3,
        "end_offset": 4,
        "type": "word",
        "position": 2
    }
]

}

Expected is:

{
"tokens": [
    {
        "token": "ゴロン",
        "start_offset": 0,
        "end_offset": 3,
        "type": "word",
        "position": 0
    },
    {
        "token": "と",
        "start_offset": 3,
        "end_offset": 4,
        "type": "word",
        "position": 1
    }
]

}


(Jun Ohtani) #2

That is an expected behavior. I think it depends the dictionary.

In search use-case, you can use nbest feature in kuromoji_tokenizer or user_dictionary.


#3

Many thanks for the quick reply. We're using Neologd. I'm not seeing that specific entry in the dictionary. Will consider nbest and other options then. Thank you!


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.