Alternative to similarity (float fuzzyness)

Our fuzzy queries stopped working after upgrading Elasticsearch to 2.x.
It seems that float value for fuzzyness are not supported in Elasticsearch 2.x anymore?

We had tuned our value to 0.82, which worked extremely well for the kind of typos we're seeing. A lower value (with AUTO being somewhere < 0.7) leads to many many false positives, and since we're sorting not by relevance we rely on good precision.

Is there any alternative, besides doing the tokenization in the application and then splitting the search query up in multiple match queries with fuzziness set depending on the term length?

The old method relied on Lucene 4.x's floatToEdit() method, which was removed in Lucene 5.x. Ultimately, the floatToEdit() method simply converted into an edit distance of 0, 1, or 2. The Fuzzy functionality in Lucene can only do edit distances up to two for technical and performance reasons (the FST that it generates becomes too large past two edits).

The cutoff seems to be 6 and 12 characters for you fuzziness of 0.82 (converting to 1 and 2 edits, respectively. I'm not sure how big your tokens are however.

If you think you really need this length-based fuzziness, you could probably use two multifields which are explicitly gated by token length using the length token filter. Big tokens go in one, small tokens go in the other, the cutoff determined by how you want the edit distances to be applied. Then you do a query against both fields with the correct fuzziness.

In general though, I think trying to fight fuzziness is a losing battle precisely because it is so fragile. It just takes a small shift in the dataset for the fuzzy matches to stop returning good results. Or in this case, a small change to the query behavior. If possible, suggesters are a much cleaner way to deal with typos imo.

That is an excellent idea, I think I'll use it to get the feature working again.

But I'm curious about the suggesters. My case is a product search and people who put in "dress cheries" should see dresses with cherries on them. Your suggestion is to run the search query through a term suggester first, then take the top results, add them to the search query and then run the actual search?

That's definitely one way you could use them. There are many possible avenues you could explore, depending on requirements.

Generally I'd suggest a multi-tiered suggestion route if you have the time/energy to setup the system:

  • Completion suggester to provide as-you-type autocompletion. This often preempts typos, since people are more likely to just use the suggestion instead of finishing typing
  • Phrase Suggester to provide "Did you mean?" style phrases at the top/bottom of searches, like how Google displays them
  • Sometimes a Term Suggester, to enrich the search results with a secondary search when there are zero (or lowly scoring) results. Not ideal however, since it requires a second search.

What's nice about suggesters is that you can often give the user the "correct" suggestion outside of the query itself, which allows it to focus on more precise matches like exact, phrase, multi-field, etc. It's easier to tune since you aren't fighting precision/recall tradeoff, and if there aren't sufficiently good results the user can fallback to the "Did you mean?" suggestions and try again.

It does take a bit more work to setup (particularly the completion suggester), but it often delivers a more pleasant user experience and makes your life as a dev easier in the long run :slight_smile: