Question Symbol filtering and machine learning error; startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=14,endOffset=13

Chenko · May 3, 2024, 2:06pm

Hi!

I have the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=14,endOffset=13"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "startOffset must be non-negative, and endOffset must be >= startOffset; got startOffset=14,endOffset=13"
  },
  "status": 400
}

when running the following call:

POST /_ml/trained_models/.multilingual-e5-small_linux-x86_64/_infer
{
  "docs": {
    "text_field": "bla bla bla ⅓ bla bla bla"
  }
}

I figured out the error came because of the '⅓' symbol.
Is there any way in my pipeline that I can filter such symbols out? so only the characters the model understand go through?

I am using Elastic Cloud V8.13.3 for this test.

I was originally using the _update_by_query api to update my index when coming across this error. Some feedback from me; make it easier to see that this error was coming from the ML job/ pipeline. Took me a bit to figure out it was because of the symbol.

Kind regards,
Chenko