Suggestions for person names doens't work so well

Daniele_Renda · September 9, 2020, 3:38pm

Hi guys,
I know that this use case is quite common but I'm a bit confused after I've read several tutorials and docs.
In my application I'm trying to help the user when he digits a person's name in the form. I'd like to show a suggestion/autocomplete to:

make insert faster
reduce mispelled names

I've an index with all common names in my country.
In the first instance I used a fuzzy search but I soon realized this is not what I want because during typing results are not what I expect.
For example when I write "Dan", I'd like to see also "Daniel" as result. I need an "autocomplete with fuzzy search". I don't really know if is so simple to show smart autocomplete for names because there are several things to consider: I wrote partial name, I wrote a mispelled name.

I saw suggesters, fuzzy query....

If you could point me in the right direction I'd appreciate.

Thanks

Daniele_Renda · September 10, 2020, 7:39am

After a bit of tries I came up with something even if I'm not completely satisfied of results.

This is my mapping:

{
  "properties": {
    "name": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "name_suggest": {
      "type": "completion",
      "contexts": [
        {
          "name": "country_context",
          "type": "category",
          "path": "country.keyword"
        }
      ]
    },
    "gender": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    },
    "country": {
      "type": "text",
      "fields": {
        "keyword": {
          "type": "keyword"
        }
      }
    }
  }
}

An example of data is this:
Senzanome

I want to show suggestions to the user during typing using also fuzzy to allow some typo.
I came up with this query that uses suggestions AND exact match because I want to give a better score to exact match rather than suggestions.

{
  "size": 15,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "name.keyword": {
              "query": "Marc",
              "operator": "OR",
              "prefix_length": 0,
              "max_expansions": 50,
              "fuzzy_transpositions": true,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": true,
              "boost": 1
            }
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  },
  "_source": {
    "includes": [
      "name"
    ],
    "excludes": []
  },
  "suggest": {
    "text": "Marc",
    "complete": {
      "text": "Marc",
      "prefix": "Marc",
      "completion": {
        "field": "name_suggest",
        "size": 10,
        "fuzzy": {
          "fuzziness": 1,
          "transpositions": true,
          "min_length": 3,
          "prefix_length": 1,
          "unicode_aware": false,
          "max_determinized_states": 10000
        },
        "contexts": {
          "country_context": [
            {
              "context": "IT",
              "boost": 1,
              "prefix": false
            }
          ]
        }
      }
    }
  }
}

and these are results:

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 0,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "suggest": {
    "complete": [
      {
        "text": "Marc",
        "offset": 0,
        "length": 4,
        "options": [
          {
            "text": "Mara",
            "_index": "personname",
            "_type": "_doc",
            "_id": "2046830049",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Mara"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marat",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-1718195506",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marat"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marca",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-2041994534",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marca"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcantonio",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-16856444",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcantonio"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcella",
            "_index": "personname",
            "_type": "_doc",
            "_id": "48281663",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcella"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcelliano",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-836086954",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcelliano"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcellina",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-695286534",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcellina"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcellino",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-371432729",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcellino"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marcello",
            "_index": "personname",
            "_type": "_doc",
            "_id": "372135468",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marcello"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          },
          {
            "text": "Marchetto",
            "_index": "personname",
            "_type": "_doc",
            "_id": "-1073596950",
            "_score": 4,
            "_routing": "global",
            "_source": {
              "name": "Marchetto"
            },
            "contexts": {
              "country_context": [
                "IT"
              ]
            }
          }
        ]
      }
    ]
  }
}

What I expected:

I expected to have in results at least "Marco" (italian name).

What I got:

I got these suggestions from ES that are quite far from what the user wants:

[
    "Mara",
    "Marat",
    "Marca",
    "Marcantonio",
    "Marcella",
    "Marcelliano",
    "Marcellina",
    "Marcellino",
    "Marcello",
    "Marchetto"
]

I don't get why "Marco" that is closer to the search string "Marc" is not selected.
A small detail: If I increase the size of results from 10 to 20 I got this response that contains "Marco":

[
    "Mara",
    "Marat",
    "Marca",
    "Marcantonio",
    "Marcella",
    "Marcelliano",
    "Marcellina",
    "Marcellino",
    "Marcello",
    "Marchetto,
    "Marchina",
    "Marchino",
    "Marchisio",
    "Marciano",
    "Marciliano",
    "Marcilio",
    "Marco",
    "Marcolina",
    "Marcolino",
    "Marcuccia"
]

but the sorting of results are not good enough and I don't understand how I can improve that results.

I hope in some hint. Thanks

Mark_Harwood · September 10, 2020, 8:58am

I suspect it will be hard to have a single request that both completes partial names and spell-checks sensibly. A spell check usually starts with the assumption that words are complete.

You can match names on the basis of

How they look
How they sound
How they're actually used.

The problem with 1) is that simple fuzzy matching (edit distance) is not perfect. It's a short distance from john to joan and a long distance from bob to robert

The problem with 2) is that while we have phonetic analyzers they use exact-code matching only (there's no degrees of fuzziness) and they can have many false positives and false negatives.

Option 3 is the best but relies on a lot of data. If you have a load of strong IDs e.g customer account numbers/email addresses etc and for each, a bunch of associated names (including noise like typos, misheard pronunciations, shortened names etc) then you can machine-learn a thesaurus like this:

This is a weighted graph that can tell you janes is statistically more strongly associated with james than jane (probably because the n is next to m on the keyboard).
Not everyone is lucky enough to have these piles of data though which is why firms like Basis Technology offer specialised solutions.

It's not an easy problem.

Daniele_Renda · September 10, 2020, 1:54pm

Thanks for your really interesting explanation. I see that is really complex the last solution even if it would be the best of course.

Unfortunately I don't have such data and I should accept a compromise. I suppose a simpler approach would be to:

Use search as you type when the user is writing the name, trying to anticipate what he is writing
Run a Suggester like what I'm doing now, after the user wrote the whole word, and check if what the user wrote is included in the ES' suggestion list. If it's included it means the name is right, otherwise I can show some UI component saying "Did you mean <>?"

What do you think about? Thanks very much

Mark_Harwood · September 10, 2020, 4:08pm

That makes perfect sense to me.

system · October 8, 2020, 4:08pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
HowTo do autocompletion? Elasticsearch	1	267	July 6, 2017
Completion Suggester with Fuzzy Search Quries Elasticsearch	1	370	July 6, 2018
Get direct hit first using fuzzy query in autocompletion suggester Elasticsearch	1	463	November 30, 2017
Improving relevancy of suggestions Elasticsearch	1	298	August 24, 2018
Suggesting search terms to search for Elasticsearch	3	325	July 6, 2017

Suggestions for person names doens't work so well

Related topics