try indexing the field with nGram analyzer. It generally yields better results then Fuzzy. set the mingram to smallest length of your string and max gram to largest length of your string.
But doesn't Fuzzy Search match any and all strings corresponding to a particular Damaeu-Levenshtein distance, like matching SCOOL with SCHOOL if edit distance is defined as 1 and not matching COOL with SCHOOL as it required an edit distance of 2 ? Or is there any limitation of Fuzzy Search in ES that am not aware of ? Please let me know of any such limitation if there exists one.
Also Ngram tokenizers as per my knowledge are more useful for languages which very long compound words as per elasticsearch official site -
"The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.
Also in my case the min_gram would range from 1-2 and max_gram would go till 10-15. This would produce a lot of tokens for each indexed field. Your views regarding this ?
Ok, can you let me know what do you mean by this "offset" you are talking about and how it affects fuzzy searches ? I didn't know there existed such a thing. Also I have edited my earlier reply, can you have a look at that.
"FOOTWEAR" with "FOOTWEAT" nGrams will match here.
I also noticed you are setting a boost of 0.1f which is punishing for matching. Did you really want to boost use a value greater than 1. less than one lower the score.
nGrams will produce lot of space and take quite a bit of diskspace too. But your queries will run faster than wildcard or fuzzy.
Ok, I got why ngrams would be faster, but I am sorry I still didn't get why SCOOL will not match SCHOOL as insertion of a H should be allowed in fuzzy. Also since my highest relevance matching fields have a boost of 0.9f, I have adjust the boosting of other fields accordingly so that they don't get more relevancy than that one. Also why is it punishing to use a boost of 0.1f if all my other fields also use boosts in similar range as that ?
If you want boost a certain result use boost greater than one, if you want lower the relevancy use boost between 0-1. Please read documentation regarding this.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.