Misspelled words or Typo Mistakes handling in Elastic Search without fuzziness

Mohandass · January 15, 2024, 3:41pm

Hi All,
We are working on an ecommerce product with Next JS and Python API driven project. In this we have implemented Elasticsearch Rest based API calls from React JS.

We are facing a problem as below,
Spelling Mistakes to be handled without fuzziness
For example: flour tiels should only match floor tiles. This should not match below items

floor mats
wooden floor
floor carpet
virtrified tiles
wall tiles
Kitchen tiles

So we are unable to use fuzziness here, We have tried with nGrams option but it was not helpful.

Any other suggestion to handle this would a great support for us, Thank you all !

Christian_Dahlqvist · January 15, 2024, 3:54pm

This is typically what fuzziness is used for. Why are you not able to use it here? What is the mapping of the field(s) you are querying?

Mohandass · January 15, 2024, 4:02pm

Hi @Christian_Dahlqvist, Just updated the question, there was some words mistake

For example: flour tiels should only match floor tiles. This should not match below items

floor mats
wooden floor
floor carpet
virtrified tiles
wall tiles
Kitchen tiles

Fuzzy is matching above items in our case !

Christian_Dahlqvist · January 15, 2024, 4:23pm

That is probably because of the mapping of the field(s) queried and/or the query. Can you please share these?

Sean_Story · January 15, 2024, 4:28pm

I think you might be confusing fuzzy matching (also called "typo tolerance") with phrase matching, which allows you to get lower-scored documents when fewer terms (but still some terms!) match.

In the examples you gave, I'd expect Floor tiles to be the top result, but for the others to still show up in your result set, but with lower scores. This is because each of them have one term that fuzzy-matches one of the query terms. But since Floor tiles has two terms that fuzzy-match the two query terms, it should be the top result.

Side note - if you're finding yourself confused or overwhelmed by all the moving pieces of building a relevant search solution on top of bare Elasticsearch, I suggest you look at Elastic App Search which obfuscates a lot of the low-level details, but can help you set up an end-user search experience that mimics a lot of the relevance patterns we come to expect when using tools like Google every day.

Mohandass · January 16, 2024, 3:50am

Hi @Christian_Dahlqvist,
Here is the query

Request Query

{
    "_source": [
        "title",
        "keywords",
        "description",
        "other_information"
    ],
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "flour tiels",
                        "fields": [
                            "title^4",
                            "keywords^3",
                            "description^2",
                            "other_information^1"
                        ],
                        "fuzziness": "AUTO"
                    }
                }
            ]
        }
    }
}

Document Mapping

mappings = {
    "properties": {
        "title": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "keywords": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "description": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        },
        "other_information": {
            "type": "text",
            "analyzer": "synonym_stemmer_bad_words_analyzer",
            "fields": {
                "keyword": {
                    "type": "keyword"
                }
            }
        }
    }
}

Christian_Dahlqvist · January 16, 2024, 5:52am

As the query is structured at the moment I believe it will match any document that contains either term in the query, which is why you get a lot of matches that you do not want. If you require both terms to match (fuzzy or not fuzzy) you need to change the query.

Are you expecting the two search terms to be in the same field or can they appear in different fields? If it is the first you can use an and operator in your multi-match query clause. If it is the latter you could instead look at using the cross fields query type.

Mohandass · January 16, 2024, 8:34am

Ok Let me try that @Christian_Dahlqvist, Thanks. I'll update once tested !

Mohandass · January 18, 2024, 6:25am

Hi @Christian_Dahlqvist,
We used "type": "best_fields", "operator": "and", this works for 80% of our expectations. When we give tabel mis-spelled, then we aren't getting any results because of fuzziness is disabled.

When fuzziness enabled, we get some taps and other products like dinning table, table top which should not match in our case. Taps is matched becuase ta is matched against the searched keyword.

Either we need to do 2 calls to see without fuzziness and if result is not present then search with fuzziness ! Can we make a call with keeping fuzziness optional and auto-apply only if no results found or Any other solution ?

{
    "from": 0,
    "size": 10000,
    "query": {
        "bool": {
            "must": [
                {
                    "multi_match": {
                        "query": "table",
                        "fields": [
                            "title^9",
                            "keywords^8",
                            "description^7",
                            "attributes^6",
                            "other_information^5",
                            "tagged_services.name^4",
                            "category_id.name^3",
                            "sub_category_id.name^2",
                            "sub_sub_category_id.name^1"
                        ],
                        "type": "best_fields",
                        "operator": "and"
                        // "fuzziness": "AUTO"
                    }
                }
            ]
        }
    }
}

Christian_Dahlqvist · January 18, 2024, 6:43am

Why would dinning table and table top not match if you serach for just tabel?

The reason taps would match when you search for tabel is probably due to your custom analyzer that includes stemming. AUTO fuzziness for a 5 character search string only matches 1 edit distance. If the stemmer generates even shorter tokens this may go to 0 and require exact match. Without the stemmer in place I suspect you would not see this.

You can use the analyze API to compare how 'taps' and 'tabel' are tokenised. If you look at the length of the tokens generated and compare this to the AUTO fuzziness length table I think you will see why you are getting the match.

You can not have that kind of conditional logic within a query but you can send the 2 requests in parallel in a single call using the multi search API.

Mohandass · January 18, 2024, 7:07am

Yes you are correct, this could be considered ! But Taps is not expected !

Christian_Dahlqvist · January 18, 2024, 7:11am

What does the analyze API give? If taps stems to tap and table stems to tab (just examples) you will be searching with a 3 letter search string that allows 1 edit distance. The difference between the two stemmed strings is 1 edit distance and therefore match.

Mohandass · January 18, 2024, 7:52am

Analyze results as,

table stems to tabl
tabel stems to tabel
tap stems to tap
taps stems to tap

Query: Table

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "table"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "table",
                    "start_offset": 0,
                    "end_offset": 5,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 62 6c 65]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "table",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c 65]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "table",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c 65]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "table",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c 65]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tabl",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 6c]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Query: Tabel

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "tabel"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "tabel",
                    "start_offset": 0,
                    "end_offset": 5,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 62 65 6c]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tabel",
                        "start_offset": 0,
                        "end_offset": 5,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 62 65 6c]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Query: Tap

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "tap"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "tap",
                    "start_offset": 0,
                    "end_offset": 3,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 70]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 3,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Query: Taps

{
    "tokenizer": "standard",
    "filter": [
        "lowercase",
        "synonym_filter",
        "bad_words_filter",
        "stemmer_filter"
    ],
    "text": [
        "taps"
    ],
    "explain" : true
}

Response:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "standard",
            "tokens": [
                {
                    "token": "taps",
                    "start_offset": 0,
                    "end_offset": 4,
                    "type": "<ALPHANUM>",
                    "position": 0,
                    "bytes": "[74 61 70 73]",
                    "positionLength": 1,
                    "termFrequency": 1
                }
            ]
        },
        "tokenfilters": [
            {
                "name": "lowercase",
                "tokens": [
                    {
                        "token": "taps",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70 73]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "synonym_filter",
                "tokens": [
                    {
                        "token": "taps",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70 73]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "bad_words_filter",
                "tokens": [
                    {
                        "token": "taps",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70 73]",
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            },
            {
                "name": "stemmer_filter",
                "tokens": [
                    {
                        "token": "tap",
                        "start_offset": 0,
                        "end_offset": 4,
                        "type": "<ALPHANUM>",
                        "position": 0,
                        "bytes": "[74 61 70]",
                        "keyword": false,
                        "positionLength": 1,
                        "termFrequency": 1
                    }
                ]
            }
        ]
    }
}

Christian_Dahlqvist · January 18, 2024, 8:18am

Yes, I agree. Based on the analysis API output I do not see why it would ever match, even with fuzziness enabled.

When you get a match containing taps, what is the fill contents of the document? Is there some contents elsewhere that could be matching?

It would help if you could provide a full document that is returned incorrectly together with full mappings so the issue can be reproduced locally.

You could also run your query while only querying a single field at a time to see which one it is that matches. This may help to narrow it down. Note that any position may be edited due to fuzziness, so words that match may not need to even start with ta. Words like cable or able may end up matching as well.

Mohandass · January 18, 2024, 8:33am

Is there a way to avoid results containing cable or able when table searched ?

Christian_Dahlqvist · January 18, 2024, 8:45am

Not that I am aware of, at least not in the clause itself. Fuzzy matching does not allow you to control which parts of a string may be edited.

system · February 15, 2024, 8:45am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Tolerance spelling Elasticsearch	4	1667	June 1, 2017
Handling misspell terms in app search Elastic Search elastic-app-search	2	309	December 13, 2021
Fuzzy match query unexpected results Elasticsearch	3	1345	July 5, 2017
Keyword issues in ELASTIC SEARCH Elasticsearch	15	1052	June 26, 2020
Elasticsearch Fuzzy Search does not work sometimes for correctly spelled words Elasticsearch	1	991	January 3, 2018

Misspelled words or Typo Mistakes handling in Elastic Search without fuzziness

Related topics