Elasticsearch Term suggester is not returning correct suggestions when one character is missing (instead of misspelling)

I'm using Elasticsearch term suggester for spell correction. my index contains huge list of ads. Each ad has subject and body fields. I've found a problematic example for which the suggester is not suggesting correct suggestions.

I have lots of ads whose subject contains word "soffa" and also 5 ads whose subject contain word "sofa". Ideally, when I send "sofa" (wrong spelling) as text to suggester, it should return "soffa" (correct spelling) as suggestions (since soffa is correct spell and most of ads contains "soffa" and only few ads contains "sofa" (wrong spell)).

Here is my suggester query body :

{
  "suggest": {
    "text": "sofa",
    "subjectSuggester": {
      "term": {
        "field": "subject",
        "suggest_mode": "popular",
        "min_word_length": 1
      }
    }
  }
}

When I send above query, I get below response :

{
    "took": 6,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    },
    "suggest": {
        "subjectSuggester": [
            {
                "text": "sof",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "soff",
                        "score": 0.6666666,
                        "freq": 298
                    },
                    {
                        "text": "sol",
                        "score": 0.6666666,
                        "freq": 101
                    },
                    {
                        "text": "saf",
                        "score": 0.6666666,
                        "freq": 6
                    }
                ]
            }
        ]
    }
}

As you see in above response, it returned "soff" but not "soffa" although I have lots of docs whose subject contains "soffa".

I even played with parameters like suggest_mode and string_distance but still no luck.

I also used phrase suggester instead of term suggester but still same. Here is my phrase suggester query :

{
    "suggest": {
        "text": "sofa",
        "subjectuggester": {
            "phrase": {
                "field": "subject",
                "size": 10,
                "gram_size": 3,
                "direct_generator": [
                    {
                        "field": "subject.trigram",
                        "suggest_mode": "always",
                        "min_word_length":1
                    }
                ]
            }
        }
    }
}

I somehow think it doesn't work when one character is missing instead of being misspelled. in the "soffa" example, one "f" is missing. while it works fine for misspells e.g it works fine for "vovlo".
When I send "vovlo" it gives me "volvo".

Any help would be hugely appreciated.

Hi!

Try change the "string_distance".

{
  "suggest": {
    "text": "sof",
    "subjectSuggester": {
      "term": {
        "field": "title",
        "min_word_length":2,
        "string_distance":"ngram"
      }
    }
  }
}

I already tried string_distance:ngram didn’t work :frowning:

I'm doing a test with some data, "sol", "saf", "soffa" and I managed to have the correct suggestion.
If you have a list of terms that I can test here I could understand what is happening.

I've found the workaround myself.
I added ngram filter and analyzer with max_shingle_size 3 which means trigram, then added a subfield with that analyzer (trigram) and performed suggester query on that field (instead of actual field) and it worked.

Here is the mapping changes :

{
    "settings": {
        "analysis": {
            "filter": {
                "shingle": {
                    "type": "shingle",
                    "min_shingle_size": 2,
                    "max_shingle_size": 3
                }
            },
            "analyzer": {
                "trigram": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": [
                        "lowercase",
                        "shingle"
                    ],
                    "char_filter": [
                        "diacritical_marks_filter"
                    ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "subject": {
                "type": "text",
                "fields": {
                    "trigram": {
                        "type": "text",
                        "analyzer": "trigram"
                    }
                }
            }
        }
    }
}

And here is my corrected query :

{
  "suggest": {
    "text": "sofa",
    "subjectSuggester": {
      "term": {
        "field": "subject.trigram",
        "suggest_mode": "popular",
        "min_word_length": 1,
        "string_distance": "ngram"
      }
    }
  }
}

Note that I'm performing suggester to subject.trigram instead of subject itself.

Here is the result :

{
    "suggest": {
        "subjectSuggester": [
            {
                "text": "sofa",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "soffa",
                        "score": 0.8,
                        "freq": 282
                    },
                    {
                        "text": "soffan",
                        "score": 0.6666666,
                        "freq": 5
                    },
                    {
                        "text": "som",
                        "score": 0.625,
                        "freq": 102
                    },
                    {
                        "text": "sol",
                        "score": 0.625,
                        "freq": 82
                    },
                    {
                        "text": "sony",
                        "score": 0.625,
                        "freq": 50
                    }
                ]
            }
        ]
    }
}

As you can see above soffa appears as first suggestion.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.