Tokens offset issue

Hello,
I'm upgrading from Elastic 6.8.1 to 7.8.1 (tested on 8.2.3 as well) and getting the following error:

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards

It happens when I have a text with synonym and a word with delimiter char in it.
the change I had to do in my analyzer filters is moving the word_delimiter from being before the synonyms filter to be after.
so now i have:

"analyzer": {
        "stemmed_en": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "synonym_en",
            "word_delimiter",
            "stemmer_en"
          ]
        }

In addition, I have a mapping for a content field:

"content": {
        "type": "text",
        "analyzer": "stemmed_en",
        "norms": false
      }

For example, if I try to index the following document, and email has a synonym [email,e mail]:

POST index_name/_doc/12345
{
    "content": "email abc@def.com"
}

I'm getting the following error:

"type": "illegal_argument_exception",
"reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=17,lastStartOffset=14 for field 'content'"

Using the analyze API I get the following tokens:

{
    "tokens": [
        {
            "token": "email",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "e",
            "start_offset": 0,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "abc",
            "start_offset": 6,
            "end_offset": 9,
            "type": "word",
            "position": 1
        },
        {
            "token": "def",
            "start_offset": 10,
            "end_offset": 13,
            "type": "word",
            "position": 2
        },
        {
            "token": "com",
            "start_offset": 14,
            "end_offset": 17,
            "type": "word",
            "position": 3
        },
        {
            "token": "mail",
            "start_offset": 6,
            "end_offset": 17,
            "type": "SYNONYM",
            "position": 3
        }
    ]
}

So its very clear that the synonym offset is getting mixed with the word delimiting.
can you help me understand whats the cause and how to fix it?

  • note: I have also tried using word_delimiter_graph instead of word_delimiter but got the same error.

Thanks

looks like a bug in synonyms filter:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.