ELSER2 | Spell check before creating embeddings

Hello Team,

Any suggestion of doing spell check before creating embeddings? e.g. if the query is misspelt "toiket rolls" instead of "toilet rolls" can we create the embeddings for "toilet rolls" using ELSER2 model

POST _ml/trained_models/elser-model-2-for-ingest-search/_infer
{
  "docs":{
    "text_field": "toiket rolls"
  }
}

Result:

{
  "inference_results": [
    {
      "predicted_value": {
        "##ike": 2.758485,
        "##t": 2.2162936,
        "roll": 2.0772736,
        "to": 1.8866745,
        "rolls": 1.8158195,
        "rolling": 1.4678012,
        "##uge": 0.92356,
        "bring": 0.9175947,
        "sue": 0.7835926,
        "##k": 0.63172615,
        "technique": 0.61145663,
        "festival": 0.5835248,
        "##te": 0.5779746,
        "dutch": 0.5647326,
        "wheel": 0.5633026,
        "##nt": 0.5176124,
        "roller": 0.5174776,
        "japanese": 0.50211316,
        "flute": 0.49578997,
        "movement": 0.4836977,
        "german": 0.4779577,
        "rake": 0.46293172,
        "cake": 0.44880012,
        "horse": 0.42376143,
        "hand": 0.39447936,
        "dance": 0.3883844,
        "stunt": 0.3841404,
        "craft": 0.35372037,
        "stock": 0.31527817,
        "puppet": 0.29949415,
        "##ts": 0.28825995,
        "film": 0.27967602,
        "hang": 0.27863201,
        "beer": 0.25969,
        "paper": 0.25739628,
        "rice": 0.2504973,
        "rope": 0.20884833,
        "ski": 0.17991425,
        "dodge": 0.17231494,
        "ko": 0.16818042,
        "art": 0.15494661,
        "whip": 0.15116276,
        "foot": 0.14420456,
        "band": 0.14200562,
        "windmill": 0.13235468,
        "welcome": 0.12275867,
        "weaving": 0.10461076,
        "production": 0.07868658,
        "truck": 0.06703148,
        "vehicle": 0.05202646,
        "ride": 0.030120868,
        "build": 0.023026925,
        "french": 0.021837963,
        "fake": 0.019907437,
        "brake": 0.012350509,
        "wright": 0.0088391695,
        "piece": 0.006032948,
        "style": 0.0019135037
      }
    }
  ]
}

For definite you will get better results with spelling correction when the query is mistyped. It is always a tricky balance to strike since sometimes the user may mean to exact match on a character sequence, such a product code, which is not a real word. One might envision a very simple test for this case by checking if there are exact matches on each alphanumeric sequence in the query.

At the moment any spell correction logic would have to live upstream from the query to Elasticsearch. (It's beyond the scope of this forum to give detailed advice on how best to do spelling correction.) However, models do have some typo tolerance and we do train ELSER with some typos. We haven't yet tried to systematically improve its robustness to spelling errors via training, which is something we plan to explore. We've also recently been exploring training a seq2seq model to create "did you mean" suggestions. IMO this is the best way of tackling this problem. This work is still at an early stage, but it is something we are currently researching.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.