Fuzzy match query unexpected results


(Emil) #1

I want to add misspelling control in my match query. For that reason I added fuzziness as below but this totally changed the expected results when I don't do fuzziness.

I am using mapping and analyzers as below

{
"state": "open",
"settings": {
"index": {
"creation_date": "1457443337681",
"analysis": {
"filter": {
"my_edge_ngram_analyzer": {
"type": "edgeNGram",
"min_gram": "2",
"max_gram": "10"
},
"my_word_delimiter": {
"catenate_all": "true",
"type": "word_delimiter"
}
},
"analyzer": {
"my_analyzer": {
"filter": [
"standard"
,
"lowercase"
,
"my_word_delimiter"
,
"my_edge_ngram_analyzer"
],
"type": "custom",
"tokenizer": "whitespace"
}
}
},
"number_of_shards": "5",
"number_of_replicas": "1",
"version": {
"created": "2020099"
}
}
},
"Name": {
"search_analyzer": "standard",
"analyzer": "my_analyzer",
"type": "string"
},
"ShortDescription": {
"search_analyzer": "standard",
"analyzer": "my_analyzer",
"type": "string"
}
}
},

Here how it looks like without fuzziness.

{
   "query": {
      "bool": {
         "should": [
            {
               "multi_match": {
                  "type": "best_fields",
                  "query": "hp 301",
                  "fields": [
                     "Name^7",
                     "ShortDescription^6"
                  ]
               }
            }
         ]
      }
   }
}

as expected this query will return me most relevant results for hp 301

 "_source": {
               "id": 1,
               "Name": "l HP CH561EE / 301 Black",
               "ShortDescription": "301  
   "_source": {
               "id": 2,
               "Name": " HP E5Y87EE / 301 Set (2 x Black)",
              "ShortDescription": "301  

I am expecting the same results when I use fuzziness. as I understand fuzziness should only fix misspellings but not change the query results.
If I use fuzziness as AUTO with prefix_length 0, I get results as

{
   "query": {
      "bool": {
         "should": [
            {
               "multi_match": {
                  "type": "best_fields",
                  "query": "hp 301",
                  "fuzziness":"AUTO",
                  "prefix_length":0,
                  "fields": [
                     "Name^7",
                     "ShortDescription^6"
                  ]
               }
            }
         ]
      }
   }
}

Below results is totally irrelevant. only HP is the both fields. How does it get highest score?

     "_source": {
           "id": 123,
           "Name": "HP CE411A / 305A Cyan",
           "ShortDescription": "305A",
   "_source": {
           "id": 1234,
           "Name": "HP CC530A bis CC533A Set",
           "ShortDescription": "304A",

More dramatic is that when I use fuzziness as 2 instead of AUTO, I get results as makes no sense. Why would I get 2nd one which has neither hp nor 301.

  "_source": {
               "id": 345,
               "Name": "Utax 4401410015 Black",      
               "ShortDescription": "LP3014",
 "_source": {
               "id": 3400,
               "Name": "Konica Minolta 8936-404 / EP302B Black",      
               "ShortDescription": "EP302B",

Further when I use "fuzziness":2, "prefix_length":1 in the same query, I am getting different results

 "_source": {
               "id": 778,
               "Name": "593-10122 / HG308 Yellow",             
               "ShortDescription": "HG308",

"fuzziness":"AUTO", "prefix_length":1 has also different results,

  "_source": {
               "id": 8990,
               "Name": "C 13 S0 53021 / 3021",         
               "ShortDescription": "3021",

Can somebody explain me what am I doing wrong? Do I not understand fuzziness correctly?


(Jimferenczi) #2

The match query will first analyze your query string ("hp 301" for instance), this will produce the terms that the query must match. If fuzziness is involved then each term with a Levenshtein distance smaller than N will match, the term "hp" matches "ep" or "lp" for instance.
Bottom line is that fuzziness on an analyzed field with edge ngram (especially with a min length of 2) is not recommended.
You should use a suggester instead: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html


(Emil) #4

what kind of suggestion can i use to get misspellings? Can you give me a hint please?


(system) #5