Question about strange behaviour with Fuzzy query in elasticsearch

Hello everyone,

I am new to the elastic stack and fulltext search for this reason there came up a question regarding some fuzzy queries I was working with the last Weeks.

For context I want to realise a system where the index contains documents with data regarding persons like firstname, lastname etc. and the query contains text which can mention these people at any position without a special format. Furthermore I needed to include the possibility of typos therefore I used the fuzzy option.

tl;dr How does boosting for fuzzy querys work. Strange behaviour when inspecting query with _explain endpoint.

At first, I will provide the structure of my Index.

{
  "people_large" : {
    "mappings" : {
      "properties" : {
        "city" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "firstname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "houseNumber" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "lastname" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "plz" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        "street" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        }
      }
    }
  }
}

The query for my problem is the follwing.

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "Beschwerde √ľber Ihren SupportSehr geehrte Damen und Herren, zuallererst einmal m√∂chte ich mich bei ihnen f√ľr das langj√§hrige gute Dienstleister-Kunden-Verh√§ltnis bedanken. Ich bin seit Jahren Kunde bei Ihnen und war immer zufrieden mit Ihren Leistungen. Auch wenn es einmal Probleme gab wurden diese schnell und kompetent behoben. Ein Beispiel ist, dass mir eine defekte DSL-Box sofort und ohne Z√∂gern ersetzt wunde. In einem anderen Fall kam bei einem (selbstverursachten) Leitungsschaden sofort ein Techniker: der den Schaden behob. Von daher bin ich mit der Telekom als Dienstleister stets sehr zufrieden und werde Ihre Leistungen hoffentlich auch noch viele Jahre in Anspruch nehmen. Mein eigentliches Anliegen ist leider etwas unerfreulicherer Natur. Ich stehe Seit Mitte Januar diesen Jahres mit Hr. Muster aus Ihrem Support-Team in Kontakt. Hintergrund ist, dass ich zum Ende des letzten Jahns das Leistungspaket f√ľr Internet&Fernsehen ge√§ndert hatte von MagentaZuhause L auf MagentaZuhause M. Leider wurde mir fur Januar und nun auch f√ľr Februar 2015 weiterhin der Tarif f√ľr MagentaZuhause L berechnet. Ich habe versucht, dieses Thema mit Ihrem Support, insbesondere mit Hr. Muster, zu kl√§ren. Leider ist hier keine Einigung in Sicht. Hr. Muster beharrt darauf, dass keine √Ąnderung desLeistungspaketes stattgefunden hat und insofern der abgebuchte Betrag korrekt ist. Mir liegen √Ąnderungsanfrage (Anhang A) und √Ąnderungsbest√§tigung seitens der Telekom (Anhang B) auf jeden Fall vor. Diese Schreiben habe ich auch Hr. Muster zugesandt, der daraufhin unterstellte, es k√∂nne sich dabei auch um F√§lschungen handeln. Dieses Verhalten und den Umgang mit mir als langj√§hrigem Kunden finde ich sehr bedauerlich. Um weitere Missverst√§ndnisse zu vermeiden, und mein Vertrauen in die Telekom auch weiterhin aufrecht zu erhalten, wurde ich den folgenden Kompromiss vorschlagen. Bitte weisen Sie den Fall einem anderen Support-Mitarbeiter zu, und lassen Sie die √Ąnderung des Leistungspaketes noch einmal √ľberpr√ľfen. Ich bin mir sicher, dass sich mit einem Mitarbeiter mit unvoreingenommener Einstellung eine akzeptable L√∂sung finden l√§sst. Ich f√§nde es sehr schade, wenn die bislang guten Erfahrungen nun wegen eines Missverst√§ndnisses ins Negative umschlagen. √úber eine positive R√ľcksprache w√ľrde ich mich sehr freuen, gerne auch telefonisch unter der Telefonnummer XXXXX / XXXX XXXX Einen freundlichen freundlichen Gru√ü, Hanna Fuchs",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

At the end of the query text you can find Hanna Fuchs. The document containing the name is contained in the people_large index. Now if I execute this query, Hanna Fuchs only receives a score of ~75.6 while Malte Eich receives a score of 520.4 .

If I now use the _explain endpoint and look at the calculations, Mark Eich receives a boost with 48.4 and Hanna Fuchs only a boost with 4.4 for the lastname field. The idf and tf are the same for both lastnames in both documents which I would expect because every document should be more or less unique. Every Person Document should only be mentioned once inside the index.

The result of _explain for Mark Eich

{
  "_index" : "people_large",
  "_type" : "_doc",
  "_id" : "XXXXXXXXXXXXXXXXXXXXXX",
  "matched" : true,
  "explanation" : {
    "value" : 520.42615,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 520.42615,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 520.42615,
            "description" : "weight(lastname:eich in 17310) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 520.42615,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 48.4,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 15.913938,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 1,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 12230000,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.6756722,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 4.999789,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

The result of _explain fpor Hanna Fuchs

{
  "_index" : "people_large",
  "_type" : "_doc",
  "_id" : "58990879-ef33-4e72-9d98-6a8b4f350c69",
  "matched" : true,
  "explanation" : {
    "value" : 75.664696,
    "description" : "sum of:",
    "details" : [
      {
        "value" : 4.697499,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 4.697499,
            "description" : "weight(street:zum in 17431) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 4.697499,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 2.2,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 5.254508,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 63888,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 12230000,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.40636092,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 2.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.5505662,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 47.311462,
        "description" : "sum of:",
        "details" : [
          {
            "value" : 47.311462,
            "description" : "weight(lastname:fuchs in 17431) [PerFieldSimilarity], result of:",
            "details" : [
              {
                "value" : 47.311462,
                "description" : "score(freq=1.0), computed as boost * idf * tf from:",
                "details" : [
                  {
                    "value" : 4.4,
                    "description" : "boost",
                    "details" : [ ]
                  },
                  {
                    "value" : 15.913938,
                    "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                    "details" : [
                      {
                        "value" : 1,
                        "description" : "n, number of documents containing term",
                        "details" : [ ]
                      },
                      {
                        "value" : 12230000,
                        "description" : "N, total number of documents with field",
                        "details" : [ ]
                      }
                    ]
                  },
                  {
                    "value" : 0.6756722,
                    "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                    "details" : [
                      {
                        "value" : 1.0,
                        "description" : "freq, occurrences of term within document",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.2,
                        "description" : "k1, term saturation parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 0.75,
                        "description" : "b, length normalization parameter",
                        "details" : [ ]
                      },
                      {
                        "value" : 1.0,
                        "description" : "dl, length of field",
                        "details" : [ ]
                      },
                      {
                        "value" : 4.999789,
                        "description" : "avgdl, average length of field",
                        "details" : [ ]
                      }
                    ]
                  }
                ]
              }
            ]
          }
        ]
      },
      {
        "value" : 23.655731,
        "description" : "weight(firstname:hanna in 17431) [PerFieldSimilarity], result of:",
        "details" : [
          {
            "value" : 23.655731,
            "description" : "score(freq=1.0), computed as boost * idf * tf from:",
            "details" : [
              {
                "value" : 2.2,
                "description" : "boost",
                "details" : [ ]
              },
              {
                "value" : 15.913938,
                "description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details" : [
                  {
                    "value" : 1,
                    "description" : "n, number of documents containing term",
                    "details" : [ ]
                  },
                  {
                    "value" : 12230000,
                    "description" : "N, total number of documents with field",
                    "details" : [ ]
                  }
                ]
              },
              {
                "value" : 0.6756722,
                "description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
                "details" : [
                  {
                    "value" : 1.0,
                    "description" : "freq, occurrences of term within document",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.2,
                    "description" : "k1, term saturation parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 0.75,
                    "description" : "b, length normalization parameter",
                    "details" : [ ]
                  },
                  {
                    "value" : 1.0,
                    "description" : "dl, length of field",
                    "details" : [ ]
                  },
                  {
                    "value" : 4.999789,
                    "description" : "avgdl, average length of field",
                    "details" : [ ]
                  }
                ]
              }
            ]
          }
        ]
      }
    ]
  }
}

I am a bit confused why the boost for Hanna Fuchs is 4.4 and for Mark Eich 48.4 . My understanding of the fuzzy query was that it will match with fuzzyness but if there are exact matches these are valued higher as the fuzzy matches.

I hope someone can help me with this problem or maybe can point me towards some further Documentation for how the boosting mechanism works or a detailed take on fuzzy querys. Because I am almost certain that the docs I already read are not enough to grasp this problem.

Thank you in advance.

Though it depends on analyzer, eich with fuzzy: AUTO (=max levenstein distance:1) could match "ich", "mich" or such words in the query. In my hypothesis, it may raise the "query cordination" score and lead to the high boost score.

This TIP seems the boost score is the product of all the other terms than tf and idf of the scoring function.

Thank you so much. You are absolutely right.

The "query coordination" seems to be the cause here.

I tried the following querys

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "ich",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "ich mich",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

GET people_large/_search
{
  "query": {
    "multi_match": {
      "query": "ich mich dich",
      "type": "most_fields", 
      "fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
      "fuzziness": "AUTO",
      "fuzzy_transpositions": "true"
    }
  }
}

And every time the score for "eich" increased aswell as the boost value from the _explain endpoint.

Btw. I am using the standard analyzer.

Now the "query coordination" raises the question if it is possible to turn it of for fuzzy querys too not only for bool querys.

What will happen if you put the fuzzy query in the top level Boolean query with coordinate score disabled?

I tried the example provided on the documentation but it seems that the disable_coord option has been removed from the bool query.

The query

GET /_search
{
  "query": {
    "bool": {
      "disable_coord": true,
      "should": [
        { "term": { "text": "Hanna" }},
        { "term": { "text": "Fuchs"  }}
      ]
    }
  }
}

The return

{
  "error" : {
    "root_cause" : [
      {
        "type" : "x_content_parse_exception",
        "reason" : "[4:7] [bool] unknown field [disable_coord]"
      }
    ],
    "type" : "x_content_parse_exception",
    "reason" : "[4:7] [bool] unknown field [disable_coord]"
  },
  "status" : 400
}
1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.