I'm trying to search for some files using the function fetch_document
to fetch a document by its name in the data.name
subfield.
This function is being used in the following way:
fetch_document(
client=server,
doc_name="gartner",
)
and it returns documents named 'Caroline Gardner', 'Rödl & Partner', 'Mio Partner', or even a 'Bidding Mart Partner'... I find this behaviour strange, and would like to understand why it happens.
Below is the query the function fetch_document
uses to get those results.
For that, we use the following criteria:
def query_criteria(query: str, field:str,fuzziness:str = "AUTO:6,12"):
fields =[f"data.{field}^2", f"data.{field}.tokenized"]
return dsl.Q(
"multi_match",
query=query,
fields = fields,
type="best_fields",
fuzziness=fuzziness,
fuzzy_transpositions=True,
minimum_should_match= "100%"
)
query = CompanyCluster.search(
using=client,
index = index
).extra(size=1)
query = query.query(dsl.Q("bool",
# usually this should field would be a list with several query_criteria
should = [query_criteria(var_name, "name", fuzziness="AUTO:6,12")],
minimum_should_match = 1
# this is equal to list length in the should field
))
res = query.execute()
The data.name
subfield has the following mapping:
name = dsl.Text(analyzer=_tag_analyzer, fields={"raw": dsl.Keyword(), "tokenized": dsl.Text(analyzer=_tokenized_tag_analyzer)})
the tag analysers that I'm using are the following:
_tag_analyzer = dsl.analyzer('tag_analyzer',
tokenizer="keyword", # Never split tags
char_filter=[_special_symbol_filter],
filter=["lowercase"] # Since we already removed everything, we only need to lowercase the text
)
_tokenized_tag_analyzer = dsl.analyzer('tokenized_tag_analyzer',
tokenizer=dsl.tokenizer("whitespace_dash", type="char_group", tokenize_on_chars=["whitespace", "-"]), # Split on whitespace & dash -
char_filter=[_special_symbol_filter_2],
filter=["lowercase", "stop"]
)
the special symbol filters simply remove numbers, whitespaces and dashes.
The way I'm understanding the fuzziness criteria fuzziness:str = f"AUTO:{x},{y}"
is that for tokens(in this case words) of length x
we're willing to have a distance of 1, and for tokens of length y
we're willing to have a distance of 2.
Given all this information, I simply don't see how the minimum_should_match= "100%"
allows for a match with just 1 word (gartner -> gardner is just at a distance of 1 character change).