Is the below a bug or expected behavior for Kuromoji? The behavior occurs on default settings and even on the Kuromoji demo page.
Kuromjiでは下記はバグか想定通りかを確認いただけますでしょうか。デフォルト設定でもKuromojiのデモページでも下記の通りになります。
Using Kuromoji as configured below.
"ja_tokenizer":{
"type":"kuromoji_tokenizer",
"mode":"search",
"discard_punctuation": "false"
}
Tokenization for ゴロンと is:
{
"tokens": [
{
"token": "ゴロ",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "ン",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "と",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 2
}
]
}
Expected is:
{
"tokens": [
{
"token": "ゴロン",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "と",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 1
}
]
}