Kuromoji: Tokenization of ゴロンと is Unexpected (incorrect?)

HajimeOwari · February 13, 2018, 7:02am

Is the below a bug or expected behavior for Kuromoji? The behavior occurs on default settings and even on the Kuromoji demo page.

Kuromjiでは下記はバグか想定通りかを確認いただけますでしょうか。デフォルト設定でもKuromojiのデモページでも下記の通りになります。

Using Kuromoji as configured below.

				"ja_tokenizer":{
				"type":"kuromoji_tokenizer",
				"mode":"search",
				"discard_punctuation": "false"
				}

Tokenization for ゴロンと is:

{
"tokens": [
    {
        "token": "ゴロ",
        "start_offset": 0,
        "end_offset": 2,
        "type": "word",
        "position": 0
    },
    {
        "token": "ン",
        "start_offset": 2,
        "end_offset": 3,
        "type": "word",
        "position": 1
    },
    {
        "token": "と",
        "start_offset": 3,
        "end_offset": 4,
        "type": "word",
        "position": 2
    }
]

}

Expected is:

{
"tokens": [
    {
        "token": "ゴロン",
        "start_offset": 0,
        "end_offset": 3,
        "type": "word",
        "position": 0
    },
    {
        "token": "と",
        "start_offset": 3,
        "end_offset": 4,
        "type": "word",
        "position": 1
    }
]

}

johtani · February 13, 2018, 7:43am

That is an expected behavior. I think it depends the dictionary.

In search use-case, you can use nbest feature in kuromoji_tokenizer or user_dictionary.

HajimeOwari · February 20, 2018, 1:47pm

Many thanks for the quick reply. We're using Neologd. I'm not seeing that specific entry in the dictionary. Will consider nbest and other options then. Thank you!

system · March 20, 2018, 1:48pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Kuromoji token offsets going backwards in extended mode Elasticsearch	1	407	November 26, 2020
Possible Issue with Kuromoji Tokenization when English/Japanese are present Elasticsearch	1	512	July 21, 2017
Elasticsearch Kuromoji plugin Elasticsearch	1	167	June 22, 2023
Kuromoji_readingform の意図しない出力について日本語による質問・議論はこちら	3	3957	July 6, 2017
Kuromoji analyzer filters out text in Arabic Elasticsearch	1	165	October 26, 2021

Kuromoji: Tokenization of ゴロンと is Unexpected (incorrect?)

Related topics