Fuzzy search doesn't return all available results

eladf · December 26, 2021, 12:51pm

Hi,

I've been trying to use fuzziness in order to find typos for specific terms.
My terms usually consist of a single word.
Although there are thousands of possible results available that I'm aware of (fuzziness of 1 or 2), only few hundreds return.
I've set the max_expansions to be 10,000 but still no progress.

I've tried both match query + fuzziness option and fuzzy query.
I've even tried changing the mapping, replacing text field with keyword field.
Nothing seems to work. The numbers are still low.

In addition, it looks as the simplest results (fuzziness = 1) don't return while other, with higher distance, do return.

As I need to perform complexed searches with the help of fuzziness, using the suggestions feature wouldn't be enough for me.

What am I missing?

I would appreciate your help.
Thanks in advance,
Elad.

Tomo_M · December 26, 2021, 1:25pm

Hi,
Can you share the query, response and some document examples which does not hit?

eladf · December 26, 2021, 2:05pm

Sure Tomo

Tried the following:

{"from":0,"size":1000,"query":{"match":{"entityId":{"query":"babyliss","fuzziness":2,"max_expansions":1000}}}}

{"from":0,"size":1000,"query":{"fuzzy":{"entityId":{"value":"babyliss","fuzziness":2, "max_expansions":1000}}}}

{"from":0,"size":1000,"query":{"fuzzy":{"entityId":{"value":"babyliss","fuzziness":auto, "max_expansions":1000}}}}

Did not hit:

bebyliss
babylyss

Did hit:
bebyliss-AI
babylyssGM

Thanks,
Elad.

Tomo_M · December 26, 2021, 2:46pm

Thanks,

The result is a bit odd because levenshtein distances from "babyliss" are
"babyliss": 0
"babylyss": 1
"bebyliss-AI": 3
"babylyssGM": 3
Did and Did not is the opposite?

And if the field type is text, I recommend to use keyword because analyzer behaviour distort the levenshtein distance and cause unintended results.

I got desired result as follows.

PUT test_fuzziness
{
  "mappings":{
    "properties": {
      "entityId": {"type": "keyword"}
    }
  }
}

POST _bulk
{"index":{"_index":"test_fuzziness", "_id":0}}
{"entityId": "bebyliss"}
{"index":{"_index":"test_fuzziness"}, "_id":1}
{"entityId": "babylyss"}
{"index":{"_index":"test_fuzziness"}, "_id":2}
{"entityId": "babyliss-AI"}
{"index":{"_index":"test_fuzziness"}, "_id":3}
{"entityId": "babylyssGM"}

POST test_fuzziness/_search
{"from":0,"size":1000,"query":{"fuzzy":{"entityId":{"value":"babyliss","fuzziness":2, "max_expansions":1000}}}}
{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0534762,
    "hits" : [
      {
        "_index" : "test_fuzziness",
        "_type" : "_doc",
        "_id" : "0",
        "_score" : 1.0534762,
        "_source" : {
          "entityId" : "bebyliss"
        }
      },
      {
        "_index" : "test_fuzziness",
        "_type" : "_doc",
        "_id" : "7zYu930BOSBwc1If6xfj",
        "_score" : 1.0534762,
        "_source" : {
          "entityId" : "babylyss"
        }
      }
    ]
  }
}

eladf · December 26, 2021, 3:03pm

Thanks Tomo for your effort.

That is interesting.

I have an index with 500M+ documents. Maybe it has something to do with it?

Another thing, many of my items include dot inside the word, do you think it is an issue?

Tomo_M · December 26, 2021, 3:09pm

What is your field type? text?
if text, what analyzer setting do you use?
maybe it depends on analyzer.

please share index settings and mappings, if possible.

eladf · December 27, 2021, 7:26am

Hi Tomo,

I tried both (keyword and text).
The text was added without any analyzer setting.

{
brands: {
mappings: {
brandentity: {
properties: {
brandName: {
type: "keyword"
},
entityId: {
type: "text"
},
isActive: {
type: "boolean"
}
}
}
}
}
}

Many entity ids are domain names, hence include the dot tld of the domain. I.e. babyliss.com (as domain names they are usually one word phrase.

Thanks,
Elad.

Tomo_M · December 27, 2021, 8:36am

As you say, domain names are usually tokenized as one word. In my environment, it returns desired results even with "text" field type. One possible experiment is to raise "max_expansions" much higher than 1000 such as 1,000,000 because phrases with levenshtein distance 2 from some 12 character words can be up to 1,000,000 or more (95 ascii characters to 13 place * 14 place /2 = 821,275). If you query more long domains, candidate phrases could be much more. I tried them in my enviroment and it returned the same results.

I couldn't reproduce your result. Unfortunately, it seems to be beyond what I can handle. Sorry.

eladf · December 27, 2021, 9:02am

Many thanks Tomo.
Your help is much appreciated.

I tried changing the max_expansions and left with the same results.
I will try to remove the dots and maybe split the phrases.

Thanks,
Elad.

Mark_Harwood · December 27, 2021, 9:09am

Check out the explain api for low level details on why things do or don’t match.

What you’ll probably find is that the things you suggested were a long match like bebyliss_GM are probably split into two words and therefore just matching on the bebyliss token.

If there are thousands or millions of word variations you need to consider then it might make sense to do all your fuzzing at index time rather than needing expensive query time levenshtein edit distance comparisons. An analyzer that uses “ngrams” of small sizes eg 3 + 4 will chop doc and search strings into smaller pieces and rank highest those docs that have most substring values in common eg byl, byli, lis, liss etc. So we do straight matching on fragments of words rather than expensive fuzzy queries comparing whole words.
It costs more disk and IO but less CPU and could give you better recall.

eladf · December 27, 2021, 9:23am

Hi Mark,

Wasn't aware of the Explain api, so thank you, I will try it.
Regarding the NGrams, I didn't want to reach it, but in case there wouldn't be any other way, I will try it.

Thanks.

eladf · December 29, 2021, 3:25pm

Update:

The issue was the dot in the phrases.
Whenever I removed the dots, the terms have returned according to the documentation.
I'm still not sure what is the reason fuzziness doesn't work correctly on single word terms with dots, but that was the reason.

Thanks Tomo and Mark for your help.
Elad.

system · January 26, 2022, 3:25pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fuzzy query that is making me crazy Elasticsearch	1	350	April 1, 2020
Confusing results from fuzzy query (1 term, 1 field) Elasticsearch	2	418	July 6, 2017
Elasticsearch Fuzzy Search does not work sometimes for correctly spelled words Elasticsearch	1	1000	January 3, 2018
Elastic Search fuzzy query max expansions Elasticsearch	1	650	July 5, 2017
Reasonable values for fuzziness, prefix_length, and max_expansions Elasticsearch	2	588	July 6, 2017

Fuzzy search doesn't return all available results

Related topics