When we upgraded our ES instance from 6.8 to 7.8, we noticed errors during indexing like below:
java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=178,endOffset
=193,lastStartOffset=187 for field
When we examined the text, we noticed that the Kuromoji tokenizer in extended mode was producing tokens that had start offsets that went backwards:
{
"tokenizer": {
"type": "kuromoji_tokenizer",
"mode": "extended"
},
"text": "株式会社ワーナーミュージック・ジャパン"
}
{
"tokens": [
{
"token": "株式",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
...
{
"token": "ク",
"start_offset": 13,
"end_offset": 14,
"type": "word",
"position": 11
},
{
"token": "ワーナーミュージック・ジャパン",
"start_offset": 4,
"end_offset": 19,
"type": "word",
"position": 12
},
]
}
This appears to be a bug. Are there any ways to work around this issue? Thank you!