Kuromoji token offsets going backwards in extended mode

matobaa · October 29, 2020, 2:49am

When we upgraded our ES instance from 6.8 to 7.8, we noticed errors during indexing like below:

java.lang.IllegalArgumentException: startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=178,endOffset
=193,lastStartOffset=187 for field

When we examined the text, we noticed that the Kuromoji tokenizer in extended mode was producing tokens that had start offsets that went backwards:

{
	"tokenizer": {
		"type": "kuromoji_tokenizer",
		"mode": "extended"
	},
	"text": "株式会社ワーナーミュージック・ジャパン"
}

{
  "tokens": [
    {
      "token": "株式",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
   ...
    {
      "token": "ク",
      "start_offset": 13,
      "end_offset": 14,
      "type": "word",
      "position": 11
    },
    {
      "token": "ワーナーミュージック・ジャパン",
      "start_offset": 4,
      "end_offset": 19,
      "type": "word",
      "position": 12
    },
  
  ]
}

This appears to be a bug. Are there any ways to work around this issue? Thank you!

system · November 26, 2020, 2:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tokens offset issue Elasticsearch	2	1401	July 21, 2022
Upgrade from 6.8.1 to 7.8.1 causes token offset issues Elasticsearch	1	195	July 12, 2022
Kuromoji: Tokenization of ゴロンと is Unexpected (incorrect?) Elasticsearch	3	636	March 20, 2018
PatternCaptureGroupTokenFilter throwing error - startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards Elasticsearch	4	44	October 8, 2024
Kuromoji_readingform の意図しない出力について日本語による質問・議論はこちら	3	3957	July 6, 2017

Kuromoji token offsets going backwards in extended mode

Related topics