Fuzzy query behaviour

Hi I'm about to make a phonelist with ES.
For user experience I want to make the search a fuzzy one

localhost:9200/phonelist/users/_search
{
  "query": {
    "match": {
      "name": {
        "query": "Kuhlmann",
        "fuzziness": "AUTO"
      }
    }
  }
}


{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 4,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.47036046,
    "hits": [
      {
        "_index": "phonelist",
        "_type": "users",
        "_id": "AVVo-gmRDI1e6YuRkpxs",
        "_score": 0.47036046,
        "_source": {
          "name": "Karina Kuhlmann",
          "division": "Service Desk",
          "age": "29"
        }
      },
      {
        "_index": "phonelist",
        "_type": "users",
        "_id": "AVVo-SC2DI1e6YuRkpxq",
        "_score": 0.41156536,
        "_source": {
          "name": "Katharina Kullmann",
          "division": "Team Gehalt",
          "age": "43"
        }
      },
      {
        "_index": "phonelist",
        "_type": "users",
        "_id": "AVVo-AkvDI1e6YuRkpxp",
        "_score": 0.19178301,
        "_source": {
          "name": "Patrick Kuhlmann",
          "division": "D&A",
          "age": "22"
        }
      }
    ]
  }
}

Why doesn't the query return the two exact matches first?
Well I had an even more confusing result back then, but I can't rebuild it on this pc.

Regards Patrick

1 Like

What version of elasticsearch are you using? This behaviour should have been fixed a little while ago.

That was the 2.3.3 Version

OK - that should include the fix to a 10 year-old Lucene issue [1] that favoured rare over exact matches.

Can you post the results of your query with explain: true added to the query body?

[1] https://issues.apache.org/jira/browse/LUCENE-329

{
  "took": 162,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 11,
    "max_score": 2.6725237,
    "hits": [
      {
        "_shard": 0,
        "_node": "bkGGKQxnRXOgJamFfQtpqw",
        "_index": "users",
        "_type": "user",
        "_id": "_explain",
        "_score": 2.6725237,
        "_source": {
          "query": {
            "match": {
              "name": {
                "query": "Kuhlmann",
                "operator": "and",
                "fuzziness": "2"
              }
            }
          }
        },
        "_explanation": {
          "value": 2.6725237,
          "description": "sum of:",
          "details": [
            {
              "value": 2.6725237,
              "description": "sum of:",
              "details": [
                {
                  "value": 2.6725237,
                  "description": "weight(_all:kuhlmann in 0) [PerFieldSimilarity], result of:",
                  "details": [
                    {
                      "value": 2.6725237,
                      "description": "score(doc=0,freq=1.0), product of:",
                      "details": [
                        {
                          "value": 0.79999995,
                          "description": "queryWeight, product of:",
                          "details": [
                            {
                              "value": 6.6813097,
                              "description": "idf(docFreq=2, maxDocs=880)",
                              "details": []
                            },
                            {
                              "value": 0.119737,
                              "description": "queryNorm",
                              "details": []
                            }
                          ]
                        },
                        {
                          "value": 3.3406549,
                          "description": "fieldWeight in 0, product of:",
                          "details": [
                            {
                              "value": 1,
                              "description": "tf(freq=1.0), with freq of:",
                              "details": [
                                {
                                  "value": 1,
                                  "description": "termFreq=1.0",
                                  "details": []
                                }
                              ]
                            },
                            {
                              "value": 6.6813097,
                              "description": "idf(docFreq=2, maxDocs=880)",
                              "details": []
                            },
                            {
                              "value": 0.5,
                              "description": "fieldNorm(doc=0)",
                              "details": []
                            }
                          ]
                        }
                      ]
                    }
                  ]
                }
              ]
            },
            {
              "value": 0,
              "description": "match on required clause, product of:",
              "details": [
                {
                  "value": 0,
                  "description": "# clause",
                  "details": []
                },
                {
                  "value": 0.119737,
                  "description": "_type:user, product of:",
                  "details": [
                    {
                      "value": 1,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 0.119737,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard": 3,
        "_node": "bkGGKQxnRXOgJamFfQtpqw",
        "_index": "users",
        "_type": "user",
        "_id": "kullmann",
        "_score": 1.7235754,
        "_source": {
          "name": "Katharina Ullmann"
        },
    "_explanation": {
      "value": 1.7235754,
      "description": "sum of:",
      "details": [
        {
          "value": 1.7235754,
          "description": "sum of:",
          "details": [
            {
              "value": 1.0342314,
              "description": "weight(_all:kullmann in 404) [PerFieldSimilarity], result of:",
              "details": [
                {
                  "value": 1.0342314,
                  "description": "score(doc=404,freq=3.0), product of:",
                  "details": [
                    {
                      "value": 0.77466124,
                      "description": "queryWeight, product of:",
                      "details": [
                        {
                          "value": 0.875,
                          "description": "boost",
                          "details": []
                        },
                        {
                          "value": 7.0473723,
                          "description": "idf(docFreq=1, maxDocs=846)",
                          "details": []
                        },
                        {
                          "value": 0.12562513,
                          "description": "queryNorm",
                          "details": []
                        }
                      ]
                    },
                    {
                      "value": 1.3350757,
                      "description": "fieldWeight in 404, product of:",
                      "details": [
                        {
                          "value": 1.7320508,
                          "description": "tf(freq=3.0), with freq of:",
                          "details": [
                            {
                              "value": 3,
                              "description": "termFreq=3.0",
                              "details": []
                            }
                          ]
                        },
                        {
                          "value": 7.0473723,
                          "description": "idf(docFreq=1, maxDocs=846)",
                          "details": []
                        },
                        {
                          "value": 0.109375,
                          "description": "fieldNorm(doc=404)",
                          "details": []
                        }
                      ]
                    }
                  ]
                }
              ]
            },
        {
          "value": 0.689344,
          "description": "weight(_all:ullmann in 404) [PerFieldSimilarity], result of:",
          "details": [
            {
              "value": 0.689344,
              "description": "score(doc=404,freq=2.0), product of:",
              "details": [
                {
                  "value": 0.6323765,
                  "description": "queryWeight, product of:",
                  "details": [
                    {
                      "value": 0.71428573,
                      "description": "boost",
                      "details": []
                    },
                    {
                      "value": 7.0473723,
                      "description": "idf(docFreq=1, maxDocs=846)",
                      "details": []
                    },
                    {
                      "value": 0.12562513,
                      "description": "queryNorm",
                      "details": []
                    }
                  ]
                },
                {
                  "value": 1.0900848,
                  "description": "fieldWeight in 404, product of:",
                  "details": [
                    {
                      "value": 1.4142135,
                      "description": "tf(freq=2.0), with freq of:",
                      "details": [
                        {
                          "value": 2,
                          "description": "termFreq=2.0",
                          "details": []
                        }
                      ]
                    },
                    {
                      "value": 7.0473723,
                      "description": "idf(docFreq=1, maxDocs=846)",
                      "details": []
                    },
                    {
                      "value": 0.109375,
                      "description": "fieldNorm(doc=404)",
                      "details": []
                    }
                  ]
                }
              ]
            }
          ]
        }
      ]
    },

Sorry couldnt put it all in one post maybe theres some missing code at the end

I'm having a hard time piecing the JSON together.
One initial observation is that I see you are using the default of 5 shards. if your index is small and will remain small (where "small" could still be measured in millions) then consider using a single shard not the default of 5.
Accuracy when you have very small amounts of data spread across multiple shards can suffer.

Oh ok thanks for noting

Anyone an idea?