Hi all,
I’m working on a name matching system using Elasticsearch with fuzziness enabled. I encountered something confusing and would really appreciate some clarification.
Problem:
- When I input "Richard John", I expected it to match "Richard Johns" (just 1 character difference).
- But it does not return a match.
- However, when I input "Elvire Aide", it does match with "Elvire Ade", which is also just a 1-character difference.
Both scenarios appear to involve a 1-edit distance — so why is one working and not the other?
I’m using a match
query with fuzziness and token count filtering like so:
"my_search_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"ascii_folding",
"custom_synonym_unique",
"stophrase_synonymphrase",
"custom_stop",
"custom_synonym_multi"
],
"char_filter": [
"wordbreaker_filter",
"punctuation_filter"
]
}
{
"query": {
"bool": {
"must": [
{
"match": {
"name_field": {
"query": "Richard John",
"analyzer": "my_search_analyzer",
"fuzziness": "AUTO:1,6",
"prefix_length": 1,
"minimum_should_match": "1<-50% 5<75%",
"operator": "AND"
}
}
}
],
"filter": [
{
"term": {
"name_token_count": 2
}
}
]
}
}
}
Example:
- Indexed name: "Richard Johns"
- Search input: "Richard John"
These differ by just 1 character (s
), but no match is returned. However:
- "Elvire Aide" correctly matches "Elvire Ade"
- But "Elvire Adie" does not match "Elvire Ade"
- "ADIK EMPIRE" does not match "ADK EMPIRE" either
My understanding:
- Elasticsearch applies fuzziness per token.
- Token count filtering (
name_token_count == 2
) ensures structural matching, but may prevent slightly longer/shorter names from appearing.
My Questions:
- Why is "Richard John"
Richard Johns, while "Elvire Aide"
Elvire Ade, when both differ by only one character?
- Does Elasticsearch only apply fuzziness within tokens, and not across?
- Is it because of the token count filter — and how can I allow near-miss hits like this while still filtering token count?
- Would using a
script_score
, or per-tokenfuzzy
queries inside abool
, be a better approach for this type of multi-token fuzzy matching?
Thanks in advance for any suggestions or clarifications!