Fuzzy search doesn't return all available results

Hi,

I've been trying to use fuzziness in order to find typos for specific terms.
My terms usually consist of a single word.
Although there are thousands of possible results available that I'm aware of (fuzziness of 1 or 2), only few hundreds return.
I've set the max_expansions to be 10,000 but still no progress.

I've tried both match query + fuzziness option and fuzzy query.
I've even tried changing the mapping, replacing text field with keyword field.
Nothing seems to work. The numbers are still low.

In addition, it looks as the simplest results (fuzziness = 1) don't return while other, with higher distance, do return.

As I need to perform complexed searches with the help of fuzziness, using the suggestions feature wouldn't be enough for me.

What am I missing?

I would appreciate your help.
Thanks in advance,
Elad.

Hi,
Can you share the query, response and some document examples which does not hit?

Sure Tomo

Tried the following:

{"from":0,"size":1000,"query":{"match":{"entityId":{"query":"babyliss","fuzziness":2,"max_expansions":1000}}}}

{"from":0,"size":1000,"query":{"fuzzy":{"entityId":{"value":"babyliss","fuzziness":2, "max_expansions":1000}}}}

{"from":0,"size":1000,"query":{"fuzzy":{"entityId":{"value":"babyliss","fuzziness":auto, "max_expansions":1000}}}}

Did not hit:

bebyliss
babylyss

Did hit:
bebyliss-AI
babylyssGM

Thanks,
Elad.

Thanks,

The result is a bit odd because levenshtein distances from "babyliss" are
"babyliss": 0
"babylyss": 1
"bebyliss-AI": 3
"babylyssGM": 3
Did and Did not is the opposite?

And if the field type is text, I recommend to use keyword because analyzer behaviour distort the levenshtein distance and cause unintended results.

I got desired result as follows.

PUT test_fuzziness
{
  "mappings":{
    "properties": {
      "entityId": {"type": "keyword"}
    }
  }
}

POST _bulk
{"index":{"_index":"test_fuzziness", "_id":0}}
{"entityId": "bebyliss"}
{"index":{"_index":"test_fuzziness"}, "_id":1}
{"entityId": "babylyss"}
{"index":{"_index":"test_fuzziness"}, "_id":2}
{"entityId": "babyliss-AI"}
{"index":{"_index":"test_fuzziness"}, "_id":3}
{"entityId": "babylyssGM"}

POST test_fuzziness/_search
{"from":0,"size":1000,"query":{"fuzzy":{"entityId":{"value":"babyliss","fuzziness":2, "max_expansions":1000}}}}
{
  "took" : 11,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0534762,
    "hits" : [
      {
        "_index" : "test_fuzziness",
        "_type" : "_doc",
        "_id" : "0",
        "_score" : 1.0534762,
        "_source" : {
          "entityId" : "bebyliss"
        }
      },
      {
        "_index" : "test_fuzziness",
        "_type" : "_doc",
        "_id" : "7zYu930BOSBwc1If6xfj",
        "_score" : 1.0534762,
        "_source" : {
          "entityId" : "babylyss"
        }
      }
    ]
  }
}

Thanks Tomo for your effort.

That is interesting.

I have an index with 500M+ documents. Maybe it has something to do with it?

Another thing, many of my items include dot inside the word, do you think it is an issue?

What is your field type? text?
if text, what analyzer setting do you use?
maybe it depends on analyzer.

please share index settings and mappings, if possible.

Hi Tomo,

I tried both (keyword and text).
The text was added without any analyzer setting.

{
brands: {
mappings: {
brandentity: {
properties: {
brandName: {
type: "keyword"
},
entityId: {
type: "text"
},
isActive: {
type: "boolean"
}
}
}
}
}
}

Many entity ids are domain names, hence include the dot tld of the domain. I.e. babyliss.com (as domain names they are usually one word phrase.

Thanks,
Elad.

As you say, domain names are usually tokenized as one word. In my environment, it returns desired results even with "text" field type. One possible experiment is to raise "max_expansions" much higher than 1000 such as 1,000,000 because phrases with levenshtein distance 2 from some 12 character words can be up to 1,000,000 or more (95 ascii characters to 13 place * 14 place /2 = 821,275). If you query more long domains, candidate phrases could be much more. I tried them in my enviroment and it returned the same results.

I couldn't reproduce your result. Unfortunately, it seems to be beyond what I can handle. Sorry.

Many thanks Tomo.
Your help is much appreciated.

I tried changing the max_expansions and left with the same results.
I will try to remove the dots and maybe split the phrases.

Thanks,
Elad.

Check out the explain api for low level details on why things do or don’t match.

What you’ll probably find is that the things you suggested were a long match like bebyliss_GM are probably split into two words and therefore just matching on the bebyliss token.

If there are thousands or millions of word variations you need to consider then it might make sense to do all your fuzzing at index time rather than needing expensive query time levenshtein edit distance comparisons. An analyzer that uses “ngrams” of small sizes eg 3 + 4 will chop doc and search strings into smaller pieces and rank highest those docs that have most substring values in common eg byl, byli, lis, liss etc. So we do straight matching on fragments of words rather than expensive fuzzy queries comparing whole words.
It costs more disk and IO but less CPU and could give you better recall.

Hi Mark,

Wasn't aware of the Explain api, so thank you, I will try it.
Regarding the NGrams, I didn't want to reach it, but in case there wouldn't be any other way, I will try it.

Thanks.

Update:

The issue was the dot in the phrases.
Whenever I removed the dots, the terms have returned according to the documentation.
I'm still not sure what is the reason fuzziness doesn't work correctly on single word terms with dots, but that was the reason.

Thanks Tomo and Mark for your help.
Elad.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.