Modify the behavior of the FuzzyQuery


(Lakomkin Egor) #1

Hi All,

I would like to ask for help and in particular which direction to start to dig. What I want to achieve is to modify fuzzy query behavior this way. Say, I have set of candidate tokens for error correction and my goal is to give more "weight" to candidates which contains changes in vowels. An example:

Lets say we search for "baban"

The candidates with distance might be:

"koban"
"bobon" <- this should have higher score.

Probably I need to add some information to token payload not only about the number of mismatches, but also about number of vowels/consonants changed.

In more general form:
I do not want to rely only on the TF/IDF statistics in such query, but also on some linguistic information: like vowel/consonant substitution.

I am quite new to the Elastic and I wanted to ask help which token filter I need to modify(if there exist any token filter for fuzzy queries).

Thank you in advance for help.


(Nik Everett) #2

You'd need to write a new query for this probably. You can probably extend
or wrap the existing fuzzy query to do the job.

I've done things like this several times. The best way to get this done is
to:

  1. Create an empty elasticsearch plugin.
  2. Add a Parser for your new fuzzy query. If you are just wrapping the
    fuzzy query the simplest thing is to probably actually delegate to the
    fuzzy query parser in elasticsearch to build the fuzzy query. For now have
    your parser just return the fuzzy query from the delegate.
  3. Write a builder and some tests for your query.
  4. Fix your parser to actually wrap the fuzzy query - this is really the
    hard part.

Keep in mind that whatever you write will only work in some languages - I
suspect there are conflicts in the definition of vowel. Also fuzzy queries
can only match an edit distance of 2 in their current form for some fun
reasons.

Once you've got that far you'll probably know better than I do what to do
next.

Nik


(system) #3