Kuromoji: Tokenization of ゴロンと is Unexpected (incorrect?)

Is the below a bug or expected behavior for Kuromoji? The behavior occurs on default settings and even on the Kuromoji demo page.

Kuromjiでは下記はバグか想定通りかを確認いただけますでしょうか。デフォルト設定でもKuromojiのデモページでも下記の通りになります。

Using Kuromoji as configured below.

				"ja_tokenizer":{
				"type":"kuromoji_tokenizer",
				"mode":"search",
				"discard_punctuation": "false"
				}

Tokenization for ゴロンと is:

{
"tokens": [
    {
        "token": "ゴロ",
        "start_offset": 0,
        "end_offset": 2,
        "type": "word",
        "position": 0
    },
    {
        "token": "ン",
        "start_offset": 2,
        "end_offset": 3,
        "type": "word",
        "position": 1
    },
    {
        "token": "と",
        "start_offset": 3,
        "end_offset": 4,
        "type": "word",
        "position": 2
    }
]

}

Expected is:

{
"tokens": [
    {
        "token": "ゴロン",
        "start_offset": 0,
        "end_offset": 3,
        "type": "word",
        "position": 0
    },
    {
        "token": "と",
        "start_offset": 3,
        "end_offset": 4,
        "type": "word",
        "position": 1
    }
]

}

That is an expected behavior. I think it depends the dictionary.

In search use-case, you can use nbest feature in kuromoji_tokenizer or user_dictionary.

Many thanks for the quick reply. We're using Neologd. I'm not seeing that specific entry in the dictionary. Will consider nbest and other options then. Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.