Kuromoji token offsets going backwards in extended mode

When we upgraded our ES instance from 6.8 to 7.8, we noticed errors during indexing like below:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=178,endOffset
=193,lastStartOffset=187 for field

When we examined the text, we noticed that the Kuromoji tokenizer in extended mode was producing tokens that had start offsets that went backwards:

{
	"tokenizer": {
		"type": "kuromoji_tokenizer",
		"mode": "extended"
	},
	"text": "株式会社ワーナーミュージック・ジャパン"
}
{
  "tokens": [
    {
      "token": "株式",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
   ...
    {
      "token": "ク",
      "start_offset": 13,
      "end_offset": 14,
      "type": "word",
      "position": 11
    },
    {
      "token": "ワーナーミュージック・ジャパン",
      "start_offset": 4,
      "end_offset": 19,
      "type": "word",
      "position": 12
    },
  
  ]
}

This appears to be a bug. Are there any ways to work around this issue? Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.