Elasticsearch multi-match query using a fuzziness criteria behaves in a strange way

Ivo_Tavares · November 21, 2022, 9:56pm

I'm trying to search for some files using the function fetch_document to fetch a document by its name in the data.name subfield.
This function is being used in the following way:

fetch_document(
    client=server,
    doc_name="gartner",
)

and it returns documents named 'Caroline Gardner', 'Rödl & Partner', 'Mio Partner', or even a 'Bidding Mart Partner'... I find this behaviour strange, and would like to understand why it happens.
Below is the query the function fetch_document uses to get those results.

For that, we use the following criteria:

def query_criteria(query: str, field:str,fuzziness:str = "AUTO:6,12"):
    fields =[f"data.{field}^2", f"data.{field}.tokenized"]
    return dsl.Q(
        "multi_match",
        query=query,
        fields = fields,
        type="best_fields",
        fuzziness=fuzziness,
        fuzzy_transpositions=True,        
        minimum_should_match= "100%"
    )

query = CompanyCluster.search(
        using=client,
        index = index
    ).extra(size=1)
query = query.query(dsl.Q("bool",
        # usually this should field would be a list with several query_criteria
        should = [query_criteria(var_name, "name", fuzziness="AUTO:6,12")],
        minimum_should_match = 1 
        # this is equal to list length in the should field
    ))
    res = query.execute()

The data.name subfield has the following mapping:

name = dsl.Text(analyzer=_tag_analyzer, fields={"raw": dsl.Keyword(), "tokenized": dsl.Text(analyzer=_tokenized_tag_analyzer)})

the tag analysers that I'm using are the following:

_tag_analyzer = dsl.analyzer('tag_analyzer',
        tokenizer="keyword", # Never split tags
        char_filter=[_special_symbol_filter],
        filter=["lowercase"] # Since we already removed everything, we only need to lowercase the text
    )
    _tokenized_tag_analyzer = dsl.analyzer('tokenized_tag_analyzer',
        tokenizer=dsl.tokenizer("whitespace_dash", type="char_group", tokenize_on_chars=["whitespace", "-"]), # Split on whitespace & dash -
        char_filter=[_special_symbol_filter_2],
        filter=["lowercase", "stop"]
    )

the special symbol filters simply remove numbers, whitespaces and dashes.

The way I'm understanding the fuzziness criteria fuzziness:str = f"AUTO:{x},{y}" is that for tokens(in this case words) of length x we're willing to have a distance of 1, and for tokens of length y we're willing to have a distance of 2.

Given all this information, I simply don't see how the minimum_should_match= "100%" allows for a match with just 1 word (gartner -> gardner is just at a distance of 1 character change).

RabBit_BR · November 21, 2022, 10:42pm

You have fuzziness= "AUTO:6,12", if Im not wrong (please Team Elastic correct me if Im wrong):

Term length 0 until 5 = only match exactly
Term length 6 until 12 = you have one edition
Term length > 12 = you have two edition

Your term has length 7, so you have one edition. Thus, received 'Caroline Gardner', 'Rödl & Partner', 'Mio Partner' and 'Bidding Mart Partner' not seems wrong.

Ivo_Tavares · November 23, 2022, 12:40am

Thanks

system · December 21, 2022, 12:40am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cross Fields w/ Fuzziness Elasticsearch	7	3244	July 6, 2017
Fuzzy match query unexpected results Elasticsearch	3	1370	July 5, 2017
"Strange" behaviour multimatch with fuzziness Elasticsearch	2	2648	September 27, 2017
Fuzzy searches not matching as expected Elasticsearch	3	516	July 6, 2017
Questions about Fuzzy Query Elasticsearch	4	645	July 6, 2017

Elasticsearch multi-match query using a fuzziness criteria behaves in a strange way

Related topics