I've been trying to use fuzziness in order to find typos for specific terms.
My terms usually consist of a single word.
Although there are thousands of possible results available that I'm aware of (fuzziness of 1 or 2), only few hundreds return.
I've set the max_expansions to be 10,000 but still no progress.
I've tried both match query + fuzziness option and fuzzy query.
I've even tried changing the mapping, replacing text field with keyword field.
Nothing seems to work. The numbers are still low.
In addition, it looks as the simplest results (fuzziness = 1) don't return while other, with higher distance, do return.
As I need to perform complexed searches with the help of fuzziness, using the suggestions feature wouldn't be enough for me.
What am I missing?
I would appreciate your help.
Thanks in advance,
Elad.
The result is a bit odd because levenshtein distances from "babyliss" are
"babyliss": 0
"babylyss": 1
"bebyliss-AI": 3
"babylyssGM": 3
Did and Did not is the opposite?
And if the field type is text, I recommend to use keyword because analyzer behaviour distort the levenshtein distance and cause unintended results.
As you say, domain names are usually tokenized as one word. In my environment, it returns desired results even with "text" field type. One possible experiment is to raise "max_expansions" much higher than 1000 such as 1,000,000 because phrases with levenshtein distance 2 from some 12 character words can be up to 1,000,000 or more (95 ascii characters to 13 place * 14 place /2 = 821,275). If you query more long domains, candidate phrases could be much more. I tried them in my enviroment and it returned the same results.
I couldn't reproduce your result. Unfortunately, it seems to be beyond what I can handle. Sorry.
Check out the explain api for low level details on why things do or don’t match.
What you’ll probably find is that the things you suggested were a long match like bebyliss_GM are probably split into two words and therefore just matching on the bebyliss token.
If there are thousands or millions of word variations you need to consider then it might make sense to do all your fuzzing at index time rather than needing expensive query time levenshtein edit distance comparisons. An analyzer that uses “ngrams” of small sizes eg 3 + 4 will chop doc and search strings into smaller pieces and rank highest those docs that have most substring values in common eg byl, byli, lis, liss etc. So we do straight matching on fragments of words rather than expensive fuzzy queries comparing whole words.
It costs more disk and IO but less CPU and could give you better recall.
Wasn't aware of the Explain api, so thank you, I will try it.
Regarding the NGrams, I didn't want to reach it, but in case there wouldn't be any other way, I will try it.
The issue was the dot in the phrases.
Whenever I removed the dots, the terms have returned according to the documentation.
I'm still not sure what is the reason fuzziness doesn't work correctly on single word terms with dots, but that was the reason.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.