How can i influence the "directionality" of a trigram match?


we use elasticsearch to search on address data and for the purpose of non-exact matches wie include a field variant of the streetname that is analyzed with an ngram tokenizer (trigrams to be specific). and we use a minimum-should-match clause of "3<75%" for the queries on this field, which means 'if there are 3 or less trigrams in the search term then all of them have to match. If there are more than 3, then 75% of them have to match'

generally this works OK, but there are cases where we get unintended results like this:

We search for "Uhland" and we find "Am Maschlandgraben". As far as i can tell what happens is that "Uhland" is split into "uhl", "hla", "lan" and "and" and 3 of those 4 trigrams can be matched to the trigrams of "Am MascHLANDgraben" (the matching part in upper case). so, 3 out of 4 is 75% that fulfills our "3<75%" requirement, so it becomes a match.

So there is a "directionality" (for lack of a better word) for that 75% match. it only looks at/counts against the number of terms in the search term and ignores how many trigrams of the indexed document are not matched.

Because one could argue that the 75% match requirement is not met in that example, because 10 out of the 13 trigrams from "Am Maschlandgraben" are not matched by the trigrams of "Uhland". And in fact, if you reverse the query and search for "Am Maschlandgraben" you won't find "Uhland" as a match. Because now the "directionality" is reversed and the query realizes that only 3 out of 13 trigrams are matched and that does not meet the requirement of "3<75%"

what i would love to figure out is how i can modify the query so that the 75% match has no "directionality" and always has to match on "both sides" of the comparison. so to stay with the example above, i neither want "Uhland" to be a match to "Am Maschlandgraben" nor "Am Maschlandgraben" a match to "Uhland"

So i guess, to put it in real life language, instead of "75% of the search term trigrams need to match the indexed document" i would like to have "75% of both search term and indexed document trigrams need to match"

i hope i communicated well enough what my intention is (english is not my native language)

here is an example of how our query looks now:

  "query": {
    "bool": {
      "should": [
          "match": {
            "address.street.trigram": {
              "query": "Uhland",
              "minimum_should_match": "3<75%"

thanks in advance!

Mario K.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.