Misspelled words or Typo Mistakes handling in Elastic Search without fuzziness

Hi All,
We are working on an ecommerce product with Next JS and Python API driven project. In this we have implemented Elasticsearch Rest based API calls from React JS.

We are facing a problem as below,
Spelling Mistakes to be handled without fuzziness
For example: flour tiels should only match floor tiles. This should not match below items

  • floor mats
  • wooden floor
  • floor carpet
  • virtrified tiles
  • wall tiles
  • Kitchen tiles

So we are unable to use fuzziness here, We have tried with nGrams option but it was not helpful.

Any other suggestion to handle this would a great support for us, Thank you all !

This is typically what fuzziness is used for. Why are you not able to use it here? What is the mapping of the field(s) you are querying?

Hi @Christian_Dahlqvist, Just updated the question, there was some words mistake

For example: flour tiels should only match floor tiles. This should not match below items

  • floor mats
  • wooden floor
  • floor carpet
  • virtrified tiles
  • wall tiles
  • Kitchen tiles

Fuzzy is matching above items in our case !

That is probably because of the mapping of the field(s) queried and/or the query. Can you please share these?

I think you might be confusing fuzzy matching (also called "typo tolerance") with phrase matching, which allows you to get lower-scored documents when fewer terms (but still some terms!) match.

In the examples you gave, I'd expect Floor tiles to be the top result, but for the others to still show up in your result set, but with lower scores. This is because each of them have one term that fuzzy-matches one of the query terms. But since Floor tiles has two terms that fuzzy-match the two query terms, it should be the top result.

Side note - if you're finding yourself confused or overwhelmed by all the moving pieces of building a relevant search solution on top of bare Elasticsearch, I suggest you look at Elastic App Search which obfuscates a lot of the low-level details, but can help you set up an end-user search experience that mimics a lot of the relevance patterns we come to expect when using tools like Google every day.

Hi @Christian_Dahlqvist,
Here is the query

Request Query

{
    "_source": [
        "title",
        "keywords",
        "description",
        "other_information"
    ],
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "flour tiels",
                        "fields": [
                            "title^4",
                            "keywords^3",
                            "description^2",
                            "other_information^1"
                        ],
                        "fuzziness": "AUTO"
                    }
                }
            ]
        }
    }
}

Document Mapping

mappings = {
    "properties": {
        "title": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "keywords": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "description": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "other_information": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        }
    }
}

As the query is structured at the moment I believe it will match any document that contains either term in the query, which is why you get a lot of matches that you do not want. If you require both terms to match (fuzzy or not fuzzy) you need to change the query.

Are you expecting the two search terms to be in the same field or can they appear in different fields? If it is the first you can use an and operator in your multi-match query clause. If it is the latter you could instead look at using the cross fields query type.

Ok Let me try that @Christian_Dahlqvist, Thanks. I'll update once tested !

Hi @Christian_Dahlqvist,
We used "type": "best_fields", "operator": "and", this works for 80% of our expectations. When we give tabel mis-spelled, then we aren't getting any results because of fuzziness is disabled.

When fuzziness enabled, we get some taps and other products like dinning table, table top which should not match in our case. Taps is matched becuase ta is matched against the searched keyword.

Either we need to do 2 calls to see without fuzziness and if result is not present then search with fuzziness ! Can we make a call with keeping fuzziness optional and auto-apply only if no results found or Any other solution ?

{
    "from": 0,
    "size": 10000,
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "table",
                        "fields": [
                            "title^9",
                            "keywords^8",
                            "description^7",
                            "attributes^6",
                            "other_information^5",
                            "tagged_services.name^4",
                            "category_id.name^3",
                            "sub_category_id.name^2",
                            "sub_sub_category_id.name^1"
                        ],
                        "type": "best_fields",
                        "operator": "and"
                        // "fuzziness": "AUTO"
                    }
                }
            ]
        }
    }
}

Why would dinning table and table top not match if you serach for just tabel?

The reason taps would match when you search for tabel is probably due to your custom analyzer that includes stemming. AUTO fuzziness for a 5 character search string only matches 1 edit distance. If the stemmer generates even shorter tokens this may go to 0 and require exact match. Without the stemmer in place I suspect you would not see this.

You can use the analyze API to compare how 'taps' and 'tabel' are tokenised. If you look at the length of the tokens generated and compare this to the AUTO fuzziness length table I think you will see why you are getting the match.

You can not have that kind of conditional logic within a query but you can send the 2 requests in parallel in a single call using the multi search API.

Yes you are correct, this could be considered ! But Taps is not expected !

What does the analyze API give? If taps stems to tap and table stems to tab (just examples) you will be searching with a 3 letter search string that allows 1 edit distance. The difference between the two stemmed strings is 1 edit distance and therefore match.

Analyze results as,

  • table stems to tabl
  • tabel stems to tabel
  • tap stems to tap
  • taps stems to tap

Query: Table

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "table"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "table",
                    "start_offset": 0,
                    "end_offset": 5,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 62 6c 65]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "table",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c 65]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "table",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c 65]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "table",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c 65]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tabl",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Query: Tabel

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "tabel"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "tabel",
                    "start_offset": 0,
                    "end_offset": 5,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 62 65 6c]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Query: Tap

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "tap"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "tap",
                    "start_offset": 0,
                    "end_offset": 3,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 70]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Query: Taps

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "taps"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "taps",
                    "start_offset": 0,
                    "end_offset": 4,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 70 73]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "taps",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70 73]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "taps",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70 73]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "taps",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70 73]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Yes, I agree. Based on the analysis API output I do not see why it would ever match, even with fuzziness enabled.

When you get a match containing taps, what is the fill contents of the document? Is there some contents elsewhere that could be matching?

It would help if you could provide a full document that is returned incorrectly together with full mappings so the issue can be reproduced locally.

You could also run your query while only querying a single field at a time to see which one it is that matches. This may help to narrow it down. Note that any position may be edited due to fuzziness, so words that match may not need to even start with ta. Words like cable or able may end up matching as well.

Is there a way to avoid results containing cable or able when table searched ?

Not that I am aware of, at least not in the clause itself. Fuzzy matching does not allow you to control which parts of a string may be edited.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.