Elasticsearch multi-match query using a fuzziness criteria behaves in a strange way

I'm trying to search for some files using the function fetch_document to fetch a document by its name in the data.name subfield.
This function is being used in the following way:


and it returns documents named 'Caroline Gardner', 'Rödl & Partner', 'Mio Partner', or even a 'Bidding Mart Partner'... I find this behaviour strange, and would like to understand why it happens.
Below is the query the function fetch_document uses to get those results.

For that, we use the following criteria:

def query_criteria(query: str, field:str,fuzziness:str = "AUTO:6,12"):
    fields =[f"data.{field}^2", f"data.{field}.tokenized"]
    return dsl.Q(
        fields = fields,
        minimum_should_match= "100%"

query = CompanyCluster.search(
        index = index
query = query.query(dsl.Q("bool",
        # usually this should field would be a list with several query_criteria
        should = [query_criteria(var_name, "name", fuzziness="AUTO:6,12")],
        minimum_should_match = 1 
        # this is equal to list length in the should field
    res = query.execute()

The data.name subfield has the following mapping:

name = dsl.Text(analyzer=_tag_analyzer, fields={"raw": dsl.Keyword(), "tokenized": dsl.Text(analyzer=_tokenized_tag_analyzer)})

the tag analysers that I'm using are the following:

_tag_analyzer = dsl.analyzer('tag_analyzer',
        tokenizer="keyword", # Never split tags
        filter=["lowercase"] # Since we already removed everything, we only need to lowercase the text
    _tokenized_tag_analyzer = dsl.analyzer('tokenized_tag_analyzer',
        tokenizer=dsl.tokenizer("whitespace_dash", type="char_group", tokenize_on_chars=["whitespace", "-"]), # Split on whitespace & dash -
        filter=["lowercase", "stop"]

the special symbol filters simply remove numbers, whitespaces and dashes.

The way I'm understanding the fuzziness criteria fuzziness:str = f"AUTO:{x},{y}" is that for tokens(in this case words) of length x we're willing to have a distance of 1, and for tokens of length y we're willing to have a distance of 2.

Given all this information, I simply don't see how the minimum_should_match= "100%" allows for a match with just 1 word (gartner -> gardner is just at a distance of 1 character change).

You have fuzziness= "AUTO:6,12", if Im not wrong (please Team Elastic correct me if Im wrong):

Term length 0 until 5 = only match exactly
Term length 6 until 12 = you have one edition
Term length > 12 = you have two edition

Your term has length 7, so you have one edition. Thus, received 'Caroline Gardner', 'Rödl & Partner', 'Mio Partner' and 'Bidding Mart Partner' not seems wrong.

Thanks :wink:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.