Tokens offset issue

itaydvir · June 21, 2022, 9:20am

Hello,
I'm upgrading from Elastic 6.8.1 to 7.8.1 (tested on 8.2.3 as well) and getting the following error:

startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards

It happens when I have a text with synonym and a word with delimiter char in it.
the change I had to do in my analyzer filters is moving the word_delimiter from being before the synonyms filter to be after.
so now i have:

"analyzer": {
        "stemmed_en": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "synonym_en",
            "word_delimiter",
            "stemmer_en"
          ]
        }

In addition, I have a mapping for a content field:

"content": {
        "type": "text",
        "analyzer": "stemmed_en",
        "norms": false
      }

For example, if I try to index the following document, and email has a synonym [email,e mail]:

POST index_name/_doc/12345
{
    "content": "email abc@def.com"
}

I'm getting the following error:

"type": "illegal_argument_exception",
"reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=17,lastStartOffset=14 for field 'content'"

Using the analyze API I get the following tokens:

{
    "tokens": [
        {
            "token": "email",
            "start_offset": 0,
            "end_offset": 5,
            "type": "word",
            "position": 0
        },
        {
            "token": "e",
            "start_offset": 0,
            "end_offset": 5,
            "type": "SYNONYM",
            "position": 0
        },
        {
            "token": "abc",
            "start_offset": 6,
            "end_offset": 9,
            "type": "word",
            "position": 1
        },
        {
            "token": "def",
            "start_offset": 10,
            "end_offset": 13,
            "type": "word",
            "position": 2
        },
        {
            "token": "com",
            "start_offset": 14,
            "end_offset": 17,
            "type": "word",
            "position": 3
        },
        {
            "token": "mail",
            "start_offset": 6,
            "end_offset": 17,
            "type": "SYNONYM",
            "position": 3
        }
    ]
}

So its very clear that the synonym offset is getting mixed with the word delimiting.
can you help me understand whats the cause and how to fix it?

note: I have also tried using word_delimiter_graph instead of word_delimiter but got the same error.

Thanks

itaydvir · June 23, 2022, 8:50am

looks like a bug in synonyms filter:

github.com/elastic/elasticsearch

Tokens offset issue

opened 04:26PM - 21 Jun 22 UTC

itaydvir-wix

>bug :Search/Search Team:Search

### Elasticsearch Version 8.2.3 ### Installed Plugins none ### Java …Version _bundled_ ### OS Version Darwin 20.6.0 Darwin Kernel Version 20.6.0 ### Problem Description The bug happens when indexing a text with a synonym and a word containing delimiter char in it (email for example). such text will throw the following error: > startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=17,lastStartOffset=14 for field 'content' it happens with both `word_delimiter` and `word_delimiter_graph`. see the given example in `Steps to Reproduce`. I have tested it on versions `7.8.1` & `8.2.3`. ### Steps to Reproduce 1. create the index with mapping: ``` PUT /text_index { "mappings": { "properties": { "content": { "type": "text", "norms": false, "analyzer": "stemmed_en" } } }, "settings": { "index": { "number_of_shards": "1", "analysis": { "filter": { "synonym_en": { "type": "synonym", "synonyms": [ "email,e mail" ] }, "stemmer_en": { "type": "stemmer", "language": "english" } }, "analyzer": { "stemmed_en": { "filter": [ "lowercase", "synonym_en", "word_delimiter_graph", "stemmer_en" ], "type": "custom", "tokenizer": "whitespace" } } }, "number_of_replicas": "1" } } } ``` 2. Index a document: ``` POST /text_index/_doc { "content": "email abc@def.com" } ``` 3. the following error will be thrown: ``` { "error": { "root_cause": [ { "type": "illegal_argument_exception", "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=17,lastStartOffset=14 for field 'content'" } ], "type": "illegal_argument_exception", "reason": "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=6,endOffset=17,lastStartOffset=14 for field 'content'" }, "status": 400 } ``` 4. Use the Analyze API to view the tokens: ``` GET /test_index/_analyze { "analyzer" : "stemmed_en", "text" : "email abc@def.com" } ``` and the result is: ``` { "tokens": [ { "token": "email", "start_offset": 0, "end_offset": 5, "type": "word", "position": 0 }, { "token": "e", "start_offset": 0, "end_offset": 5, "type": "SYNONYM", "position": 0 }, { "token": "abc", "start_offset": 6, "end_offset": 9, "type": "word", "position": 1 }, { "token": "def", "start_offset": 10, "end_offset": 13, "type": "word", "position": 2 }, { "token": "com", "start_offset": 14, "end_offset": 17, "type": "word", "position": 3 }, { "token": "mail", "start_offset": 6, "end_offset": 17, "type": "SYNONYM", "position": 3 } ] } ``` ### Logs (if relevant) _No response_

system · July 21, 2022, 8:51am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Simple index using word delimiter resulting in startOffset error Elasticsearch	1	816	April 24, 2018
WordDelimiterTokenFilter used twice in same analyzer with different configurations causes issues Elasticsearch	7	2040	March 21, 2018
Error when indexing document and analysing with "word delimiter" in ES 6.0.0-rc1 Elasticsearch	2	4096	November 24, 2017
Do entries in a synonym list always get whitespace tokenized? Elasticsearch	5	528	July 6, 2017
Synonyms not working on 0.17.8? Elasticsearch	9	626	July 6, 2017

Tokens offset issue

Related topics