Suggestions for person names doens't work so well

Hi guys,
I know that this use case is quite common but I'm a bit confused after I've read several tutorials and docs.
In my application I'm trying to help the user when he digits a person's name in the form. I'd like to show a suggestion/autocomplete to:

  1. make insert faster
  2. reduce mispelled names

I've an index with all common names in my country.
In the first instance I used a fuzzy search but I soon realized this is not what I want because during typing results are not what I expect.
For example when I write "Dan", I'd like to see also "Daniel" as result. I need an "autocomplete with fuzzy search". I don't really know if is so simple to show smart autocomplete for names because there are several things to consider: I wrote partial name, I wrote a mispelled name.

I saw suggesters, fuzzy query....

If you could point me in the right direction I'd appreciate.

Thanks

After a bit of tries I came up with something even if I'm not completely satisfied of results.

This is my mapping:

{
  "properties": {
    "name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "name_suggest": {
      "type": "completion",
      "contexts": [
        {
          "name": "country_context",
          "type": "category",
          "path": "country.keyword"
        }
      ]
    },
    "gender": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "country": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    }
  }
}

An example of data is this:
Senzanome

I want to show suggestions to the user during typing using also fuzzy to allow some typo.
I came up with this query that uses suggestions AND exact match because I want to give a better score to exact match rather than suggestions.

{
  "size": 15,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name.keyword": {
              "query": "Marc",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": {
    "includes": [
      "name"
    ],
    "excludes": []
  },
  "suggest": {
    "text": "Marc",
    "complete": {
      "text": "Marc",
      "prefix": "Marc",
      "completion": {
        "field": "name_suggest",
        "size": 10,
        "fuzzy": {
          "fuzziness": 1,
          "transpositions": true,
          "min_length": 3,
          "prefix_length": 1,
          "unicode_aware": false,
          "max_determinized_states": 10000
        },
        "contexts": {
          "country_context": [
            {
              "context": "IT",
              "boost": 1,
              "prefix": false
            }
          ]
        }
      }
    }
  }
}

and these are results:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "suggest": {
    "complete": [
      {
        "text": "Marc",
        "offset": 0,
        "length": 4,
        "options": [
          {
            "text": "Mara",
            "_index": "personname",
            "_type": "_doc",
            "_id": "2046830049",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Mara"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marat",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-1718195506",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marat"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marca",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-2041994534",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marca"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcantonio",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-16856444",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcantonio"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcella",
            "_index": "personname",
            "_type": "_doc",
            "_id": "48281663",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcella"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcelliano",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-836086954",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcelliano"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcellina",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-695286534",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcellina"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcellino",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-371432729",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcellino"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcello",
            "_index": "personname",
            "_type": "_doc",
            "_id": "372135468",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcello"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marchetto",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-1073596950",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marchetto"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          }
        ]
      }
    ]
  }
}

What I expected:

I expected to have in results at least "Marco" (italian name).

What I got:

I got these suggestions from ES that are quite far from what the user wants:

[
    "Mara",
    "Marat",
    "Marca",
    "Marcantonio",
    "Marcella",
    "Marcelliano",
    "Marcellina",
    "Marcellino",
    "Marcello",
    "Marchetto"
]

I don't get why "Marco" that is closer to the search string "Marc" is not selected.
A small detail: If I increase the size of results from 10 to 20 I got this response that contains "Marco":

[
    "Mara",
    "Marat",
    "Marca",
    "Marcantonio",
    "Marcella",
    "Marcelliano",
    "Marcellina",
    "Marcellino",
    "Marcello",
    "Marchetto,
    "Marchina",
    "Marchino",
    "Marchisio",
    "Marciano",
    "Marciliano",
    "Marcilio",
    "Marco",
    "Marcolina",
    "Marcolino",
    "Marcuccia"
]

but the sorting of results are not good enough and I don't understand how I can improve that results.

I hope in some hint. Thanks

I suspect it will be hard to have a single request that both completes partial names and spell-checks sensibly. A spell check usually starts with the assumption that words are complete.

You can match names on the basis of

  1. How they look
  2. How they sound
  3. How they're actually used.

The problem with 1) is that simple fuzzy matching (edit distance) is not perfect. It's a short distance from john to joan and a long distance from bob to robert

The problem with 2) is that while we have phonetic analyzers they use exact-code matching only (there's no degrees of fuzziness) and they can have many false positives and false negatives.

Option 3 is the best but relies on a lot of data. If you have a load of strong IDs e.g customer account numbers/email addresses etc and for each, a bunch of associated names (including noise like typos, misheard pronunciations, shortened names etc) then you can machine-learn a thesaurus like this:

This is a weighted graph that can tell you janes is statistically more strongly associated with james than jane (probably because the n is next to m on the keyboard).
Not everyone is lucky enough to have these piles of data though which is why firms like Basis Technology offer specialised solutions.

It's not an easy problem.

Thanks for your really interesting explanation. I see that is really complex the last solution even if it would be the best of course.

Unfortunately I don't have such data and I should accept a compromise. I suppose a simpler approach would be to:

  1. Use search as you type when the user is writing the name, trying to anticipate what he is writing
  2. Run a Suggester like what I'm doing now, after the user wrote the whole word, and check if what the user wrote is included in the ES' suggestion list. If it's included it means the name is right, otherwise I can show some UI component saying "Did you mean <>?"

What do you think about? Thanks very much

That makes perfect sense to me.

1 Like