Fuzziness and analysis


(Martin Maletinsky) #1

Hello,

I want to search a field containing country names which may be in 4 different languages (English, German, French and Italian) and the search should accept some typing mismatch (which I thought of solving using fuzziness). I.e. the search strings "Italy" (English), "Italien" (German), "Italie" (French), "Italia" (Italian) and "Itali" (mis-typed French version) should all match "Italy".

My first thought was to use a character filter to normalize the country name (i.e. setting the equivalence of Italy~Italien~Italie~Italia) in the index definition and to use a match query with fuzziness to cope with misspellings. However as I understand from the documentation (https://www.elastic.co/guide/en/elasticsearch/guide/current/fuzzy-match-query.html), the query string is first analyzed before the search terms are fuzzified. Therefore I am afraid my approach would not work, as I try to illustrate with the following example, assuming I use the English country name as the normalized version.

  • The search string "Detschland" (miss-spelled "Deutschland", German for Germany) will not be recognized as "Deutschland" during analysis and therefore not be transformed.
  • The Levenshtein distance between "Detschland" and the term "Germany" in the inverted index is too large for a match, although the Levenshtein distance between "Detschland" and the correct "Deutschland" would be only 1.

If there was a way to first fuzzify the search string and than run analysis on the resulting terms, "Deutschland" would appear as one of the fuzzified versions of "Detschland" and would subsequently be normalized to "Germany" in the analysis step and therefore lead to a match.

Is there a way to change the order of analysis / fuzzification or is there another approach in elastic search to solve my functional requirement?

thank you
with kind regards
Martin


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.