Why does “Richard John” fail to match “Richard Johns” despite only 1 edit distance?

Hi all,
I’m working on a name matching system using Elasticsearch with fuzziness enabled. I encountered something confusing and would really appreciate some clarification.

:magnifying_glass_tilted_left: Problem:

  • When I input "Richard John", I expected it to match "Richard Johns" (just 1 character difference).
  • But it does not return a match.
  • However, when I input "Elvire Aide", it does match with "Elvire Ade", which is also just a 1-character difference.

Both scenarios appear to involve a 1-edit distance — so why is one working and not the other?

I’m using a match query with fuzziness and token count filtering like so:

"my_search_analyzer": {
  "tokenizer": "standard",
  "filter": [
    "lowercase",
    "ascii_folding",
    "custom_synonym_unique",
    "stophrase_synonymphrase",
    "custom_stop",
    "custom_synonym_multi"
  ],
  "char_filter": [
    "wordbreaker_filter",
    "punctuation_filter"
  ]
}

{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name_field": {
              "query": "Richard John",
              "analyzer": "my_search_analyzer",
              "fuzziness": "AUTO:1,6",
              "prefix_length": 1,
              "minimum_should_match": "1<-50% 5<75%",
              "operator": "AND"
            }
          }
        }
      ],
      "filter": [
        {
          "term": {
            "name_token_count": 2
          }
        }
      ]
    }
  }
}

Example:

  • Indexed name: "Richard Johns"
  • Search input: "Richard John"

These differ by just 1 character (s), but no match is returned. However:

  • "Elvire Aide" correctly matches "Elvire Ade" :white_check_mark:
  • But "Elvire Adie" does not match "Elvire Ade" :cross_mark:
  • "ADIK EMPIRE" does not match "ADK EMPIRE" either :cross_mark:

My understanding:

  • Elasticsearch applies fuzziness per token.
  • Token count filtering (name_token_count == 2) ensures structural matching, but may prevent slightly longer/shorter names from appearing.

:brain: My Questions:

  1. Why is "Richard John" :cross_mark: Richard Johns, while "Elvire Aide" :white_check_mark: Elvire Ade, when both differ by only one character?
  2. Does Elasticsearch only apply fuzziness within tokens, and not across?
  3. Is it because of the token count filter — and how can I allow near-miss hits like this while still filtering token count?
  4. Would using a script_score, or per-token fuzzy queries inside a bool, be a better approach for this type of multi-token fuzzy matching?

Thanks in advance for any suggestions or clarifications!

Welcome!

Could you provide a full recreation script as described in About the Elasticsearch category. It will help to better understand what you are doing. Please, try to keep the example as simple as possible.

A full reproduction script is something anyone can copy and paste in Kibana dev console, click on the run button to reproduce your use case. It will help readers to understand, reproduce and if needed fix your problem. It will also most likely help to get a faster answer.

Have a look at the Elastic Stack and Solutions Help · Forums and Slack | Elastic page. It contains also lot of useful information on how to ask for help.

Hello @yap_waiyen

AFAICT Fuzziness is applied one token at a time , but Elasticsearch will only keep the first fuzzy_max_expansions (default = 50) candidate spellings it finds for each token.
The token “john” has far more than 50 one-edit neighbours in a name index, so “johns” falls outside that cut-off and never makes the query, whereas “aide” has only a handful of neighbours, so “ade” is still included and you get the hit. In other words, the miss is caused by the expansion limit, not by the token-count filter or any cross-token rule.

Perhaps you could raise fuzzy_max_expansions (or add a simple plural-stripping synonym, or switch to a phonetic/approx-string plugin) and Richard Johns should then match Richard John too.