I'm trying to get fuzzy search working using the query_string query but I'm having a lot of issues. It seems like no matter what I set my fuzzy_min_sim value to, it makes very little difference in the results.
Here is a small example of my situation:
Why is it that the above won't return the correct result, even with a 0.01 fuzzy_min_sim value? As far as I understand, it isn't a term/token problem because I'm setting the mapping to be lowercase and as a single token. I've also tested this with just a simple string as the email field like abcdefghijklmnopqrstuvwxyz and had similar results.
How can I set this up so that even if I misspell half the entire field I still get the right result? I thought that's what the fuzzy_min_sim value was, at 0.5 it will return matches where the L distance is under 0.5*len(string), i.e. 15 incorrect characters in this case.
Actually the float notion of the min_similarity is not fully supported
anymore. We only support string distance 1 or 2 (Levenshtein Distance) so
no matter what you put in the float it will be Math.min(2,
floatToLD(value)) of some sort. your example has way more than 2 edits.
simon
On Tuesday, August 13, 2013 3:03:21 AM UTC+2, Jeff Dyck wrote:
I'm trying to get fuzzy search working using the query_string query but
I'm
having a lot of issues. It seems like no matter what I set my
fuzzy_min_sim
value to, it makes very little difference in the results.
Why is it that the above won't return the correct result, even with a 0.01
fuzzy_min_sim value? As far as I understand, it isn't a term/token problem
because I'm setting the mapping to be lowercase and as a single token.
I've
also tested this with just a simple string as the email field like
abcdefghijklmnopqrstuvwxyz and had similar results.
How can I set this up so that even if I misspell half the entire field I
still get the right result? I thought that's what the fuzzy_min_sim value
was, at 0.5 it will return matches where the L distance is under
0.5*len(string), i.e. 15 incorrect characters in this case.
Hmm, interesting, thanks Simon. If this is the case, is there another recommended way of doing fuzzy matching with greater possible distance?
I find this a bit odd, though. I seem to remember doing fuzzy matching with this same type of query not too long ago and still had good results, and I've not updated my version of ES in a while.
This is the old fuzzy query that has super bad performance. The new fuzzy
query is blazing fast but has the limits to 2 edits. You can potentially
write a query plugin that does use the slow old fuzzy query but currently
we don't support is. if you really need support for it you can open an
issue and we can discuss it there.
simon
On Tuesday, August 13, 2013 5:31:20 PM UTC+2, Jeff Dyck wrote:
Hmm, interesting, thanks Simon. If this is the case, is there another
recommended way of doing fuzzy matching with greater possible distance?
I find this a bit odd, though. I seem to remember doing fuzzy matching
with
this same type of query not too long ago and still had good results, and
I've not updated my version of ES in a while.
Ahh, okay. I'll work with it as it is for now and see if it still fits our needs. In sake of performance, I'd hope that 1-2 characters would be enough anyways.
For future reference, what version of ES introduced the new fuzzy behavior?
This came with Lucene 4.0 so it's 0.90.0 that cut over to the fast
FuzzyQuery FYI it's ~20k % faster (yes 20k%!)
simon
On Tuesday, August 13, 2013 6:11:24 PM UTC+2, Jeff Dyck wrote:
Ahh, okay. I'll work with it as it is for now and see if it still fits our
needs. In sake of performance, I'd hope that 1-2 characters would be
enough
anyways.
For future reference, what version of ES introduced the new fuzzy
behavior?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.