Question about strange behaviour with Fuzzy query in elasticsearch

psydow · February 5, 2022, 1:55pm

Hello everyone,

I am new to the elastic stack and fulltext search for this reason there came up a question regarding some fuzzy queries I was working with the last Weeks.

For context I want to realise a system where the index contains documents with data regarding persons like firstname, lastname etc. and the query contains text which can mention these people at any position without a special format. Furthermore I needed to include the possibility of typos therefore I used the fuzzy option.

tl;dr How does boosting for fuzzy querys work. Strange behaviour when inspecting query with _explain endpoint.

At first, I will provide the structure of my Index.

{
  "people_large" : {
    "mappings" : {
      "properties" : {
        "city" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "firstname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "houseNumber" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "plz" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "street" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

The query for my problem is the follwing.

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "Beschwerde über Ihren SupportSehr geehrte Damen und Herren, zuallererst einmal möchte ich mich bei ihnen für das langjährige gute Dienstleister-Kunden-Verhältnis bedanken. Ich bin seit Jahren Kunde bei Ihnen und war immer zufrieden mit Ihren Leistungen. Auch wenn es einmal Probleme gab wurden diese schnell und kompetent behoben. Ein Beispiel ist, dass mir eine defekte DSL-Box sofort und ohne Zögern ersetzt wunde. In einem anderen Fall kam bei einem (selbstverursachten) Leitungsschaden sofort ein Techniker: der den Schaden behob. Von daher bin ich mit der Telekom als Dienstleister stets sehr zufrieden und werde Ihre Leistungen hoffentlich auch noch viele Jahre in Anspruch nehmen. Mein eigentliches Anliegen ist leider etwas unerfreulicherer Natur. Ich stehe Seit Mitte Januar diesen Jahres mit Hr. Muster aus Ihrem Support-Team in Kontakt. Hintergrund ist, dass ich zum Ende des letzten Jahns das Leistungspaket für Internet&Fernsehen geändert hatte von MagentaZuhause L auf MagentaZuhause M. Leider wurde mir fur Januar und nun auch für Februar 2015 weiterhin der Tarif für MagentaZuhause L berechnet. Ich habe versucht, dieses Thema mit Ihrem Support, insbesondere mit Hr. Muster, zu klären. Leider ist hier keine Einigung in Sicht. Hr. Muster beharrt darauf, dass keine Änderung desLeistungspaketes stattgefunden hat und insofern der abgebuchte Betrag korrekt ist. Mir liegen Änderungsanfrage (Anhang A) und Änderungsbestätigung seitens der Telekom (Anhang B) auf jeden Fall vor. Diese Schreiben habe ich auch Hr. Muster zugesandt, der daraufhin unterstellte, es könne sich dabei auch um Fälschungen handeln. Dieses Verhalten und den Umgang mit mir als langjährigem Kunden finde ich sehr bedauerlich. Um weitere Missverständnisse zu vermeiden, und mein Vertrauen in die Telekom auch weiterhin aufrecht zu erhalten, wurde ich den folgenden Kompromiss vorschlagen. Bitte weisen Sie den Fall einem anderen Support-Mitarbeiter zu, und lassen Sie die Änderung des Leistungspaketes noch einmal überprüfen. Ich bin mir sicher, dass sich mit einem Mitarbeiter mit unvoreingenommener Einstellung eine akzeptable Lösung finden lässt. Ich fände es sehr schade, wenn die bislang guten Erfahrungen nun wegen eines Missverständnisses ins Negative umschlagen. Über eine positive Rücksprache würde ich mich sehr freuen, gerne auch telefonisch unter der Telefonnummer XXXXX / XXXX XXXX Einen freundlichen freundlichen Gruß, Hanna Fuchs",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

At the end of the query text you can find Hanna Fuchs. The document containing the name is contained in the people_large index. Now if I execute this query, Hanna Fuchs only receives a score of ~75.6 while Malte Eich receives a score of 520.4 .

If I now use the _explain endpoint and look at the calculations, Mark Eich receives a boost with 48.4 and Hanna Fuchs only a boost with 4.4 for the lastname field. The idf and tf are the same for both lastnames in both documents which I would expect because every document should be more or less unique. Every Person Document should only be mentioned once inside the index.

The result of _explain for Mark Eich

{
  "_index" : "people_large",
  "_type" : "_doc",
  "_id" : "XXXXXXXXXXXXXXXXXXXXXX",
  "matched" : true,
  "explanation" : {
    "value" : 520.42615,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 520.42615,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 520.42615,
            "description" : "weight(lastname:eich in 17310) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 520.42615,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 48.4,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 15.913938,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 1,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 12230000,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.6756722,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 4.999789,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

The result of _explain fpor Hanna Fuchs

{
  "_index" : "people_large",
  "_type" : "_doc",
  "_id" : "58990879-ef33-4e72-9d98-6a8b4f350c69",
  "matched" : true,
  "explanation" : {
    "value" : 75.664696,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 4.697499,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 4.697499,
            "description" : "weight(street:zum in 17431) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 4.697499,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 5.254508,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 63888,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 12230000,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.40636092,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.5505662,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 47.311462,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 47.311462,
            "description" : "weight(lastname:fuchs in 17431) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 47.311462,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 4.4,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 15.913938,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 1,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 12230000,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.6756722,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 4.999789,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 23.655731,
        "description" : "weight(firstname:hanna in 17431) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 23.655731,
            "description" : "score(freq=1.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 15.913938,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 1,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 12230000,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.6756722,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "freq, occurrences of term within document",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.0,
                    "description" : "dl, length of field",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.999789,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

I am a bit confused why the boost for Hanna Fuchs is 4.4 and for Mark Eich 48.4 . My understanding of the fuzzy query was that it will match with fuzzyness but if there are exact matches these are valued higher as the fuzzy matches.

I hope someone can help me with this problem or maybe can point me towards some further Documentation for how the boosting mechanism works or a detailed take on fuzzy querys. Because I am almost certain that the docs I already read are not enough to grasp this problem.

Thank you in advance.

Tomo_M · February 5, 2022, 3:21pm

Though it depends on analyzer, eich with fuzzy: AUTO (=max levenstein distance:1) could match "ich", "mich" or such words in the query. In my hypothesis, it may raise the "query cordination" score and lead to the high boost score.

This TIP seems the boost score is the product of all the other terms than tf and idf of the scoring function.

psydow · February 5, 2022, 5:40pm

Thank you so much. You are absolutely right.

The "query coordination" seems to be the cause here.

I tried the following querys

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "ich",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "ich mich",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "ich mich dich",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

And every time the score for "eich" increased aswell as the boost value from the _explain endpoint.

Btw. I am using the standard analyzer.

Now the "query coordination" raises the question if it is possible to turn it of for fuzzy querys too not only for bool querys.

Tomo_M · February 6, 2022, 1:29am

What will happen if you put the fuzzy query in the top level Boolean query with coordinate score disabled?

psydow · February 7, 2022, 12:30pm

I tried the example provided on the documentation but it seems that the disable_coord option has been removed from the bool query.

The query

GET /_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "term": { "text": "Hanna" }},
        { "term": { "text": "Fuchs"  }}
      ]
    }
  }
}

The return

{
  "error" : {
    "root_cause" : [
      {
        "type" : "x_content_parse_exception",
        "reason" : "[4:7] [bool] unknown field [disable_coord]"
      }
    ],
    "type" : "x_content_parse_exception",
    "reason" : "[4:7] [bool] unknown field [disable_coord]"
  },
  "status" : 400
}

system · March 7, 2022, 12:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fuzzy query that is making me crazy Elasticsearch	1	348	April 1, 2020
Fuzzy query don't working as expected Elasticsearch	3	683	March 9, 2023
Fuzziness & score computation Elasticsearch	2	5900	July 6, 2017
Help to understand fuzzy score Elasticsearch	3	15	November 21, 2024
Fuzzy Search on some selected fields Elasticsearch	1	570	July 6, 2017

Question about strange behaviour with Fuzzy query in elasticsearch

Related topics