Hello everyone,
I am new to the elastic stack and fulltext search for this reason there came up a question regarding some fuzzy queries I was working with the last Weeks.
For context I want to realise a system where the index contains documents with data regarding persons like firstname, lastname etc. and the query contains text which can mention these people at any position without a special format. Furthermore I needed to include the possibility of typos therefore I used the fuzzy option.
tl;dr How does boosting for fuzzy querys work. Strange behaviour when inspecting query with _explain endpoint.
At first, I will provide the structure of my Index.
{
"people_large" : {
"mappings" : {
"properties" : {
"city" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"firstname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"houseNumber" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"lastname" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"plz" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"street" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
The query for my problem is the follwing.
GET people_large/_search
{
"query": {
"multi_match": {
"query": "Beschwerde über Ihren SupportSehr geehrte Damen und Herren, zuallererst einmal möchte ich mich bei ihnen für das langjährige gute Dienstleister-Kunden-Verhältnis bedanken. Ich bin seit Jahren Kunde bei Ihnen und war immer zufrieden mit Ihren Leistungen. Auch wenn es einmal Probleme gab wurden diese schnell und kompetent behoben. Ein Beispiel ist, dass mir eine defekte DSL-Box sofort und ohne Zögern ersetzt wunde. In einem anderen Fall kam bei einem (selbstverursachten) Leitungsschaden sofort ein Techniker: der den Schaden behob. Von daher bin ich mit der Telekom als Dienstleister stets sehr zufrieden und werde Ihre Leistungen hoffentlich auch noch viele Jahre in Anspruch nehmen. Mein eigentliches Anliegen ist leider etwas unerfreulicherer Natur. Ich stehe Seit Mitte Januar diesen Jahres mit Hr. Muster aus Ihrem Support-Team in Kontakt. Hintergrund ist, dass ich zum Ende des letzten Jahns das Leistungspaket für Internet&Fernsehen geändert hatte von MagentaZuhause L auf MagentaZuhause M. Leider wurde mir fur Januar und nun auch für Februar 2015 weiterhin der Tarif für MagentaZuhause L berechnet. Ich habe versucht, dieses Thema mit Ihrem Support, insbesondere mit Hr. Muster, zu klären. Leider ist hier keine Einigung in Sicht. Hr. Muster beharrt darauf, dass keine Änderung desLeistungspaketes stattgefunden hat und insofern der abgebuchte Betrag korrekt ist. Mir liegen Änderungsanfrage (Anhang A) und Änderungsbestätigung seitens der Telekom (Anhang B) auf jeden Fall vor. Diese Schreiben habe ich auch Hr. Muster zugesandt, der daraufhin unterstellte, es könne sich dabei auch um Fälschungen handeln. Dieses Verhalten und den Umgang mit mir als langjährigem Kunden finde ich sehr bedauerlich. Um weitere Missverständnisse zu vermeiden, und mein Vertrauen in die Telekom auch weiterhin aufrecht zu erhalten, wurde ich den folgenden Kompromiss vorschlagen. Bitte weisen Sie den Fall einem anderen Support-Mitarbeiter zu, und lassen Sie die Änderung des Leistungspaketes noch einmal überprüfen. Ich bin mir sicher, dass sich mit einem Mitarbeiter mit unvoreingenommener Einstellung eine akzeptable Lösung finden lässt. Ich fände es sehr schade, wenn die bislang guten Erfahrungen nun wegen eines Missverständnisses ins Negative umschlagen. Über eine positive Rücksprache würde ich mich sehr freuen, gerne auch telefonisch unter der Telefonnummer XXXXX / XXXX XXXX Einen freundlichen freundlichen Gruß, Hanna Fuchs",
"type": "most_fields",
"fields": [ "firstname", "lastname^2", "street", "houseNumber", "plz", "city" ],
"fuzziness": "AUTO",
"fuzzy_transpositions": "true"
}
}
}
At the end of the query text you can find Hanna Fuchs. The document containing the name is contained in the people_large index. Now if I execute this query, Hanna Fuchs only receives a score of ~75.6 while Malte Eich receives a score of 520.4 .
If I now use the _explain endpoint and look at the calculations, Mark Eich receives a boost with 48.4 and Hanna Fuchs only a boost with 4.4 for the lastname field. The idf and tf are the same for both lastnames in both documents which I would expect because every document should be more or less unique. Every Person Document should only be mentioned once inside the index.
The result of _explain for Mark Eich
{
"_index" : "people_large",
"_type" : "_doc",
"_id" : "XXXXXXXXXXXXXXXXXXXXXX",
"matched" : true,
"explanation" : {
"value" : 520.42615,
"description" : "sum of:",
"details" : [
{
"value" : 520.42615,
"description" : "sum of:",
"details" : [
{
"value" : 520.42615,
"description" : "weight(lastname:eich in 17310) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 520.42615,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 48.4,
"description" : "boost",
"details" : [ ]
},
{
"value" : 15.913938,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 12230000,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.6756722,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 4.999789,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
]
}
}
The result of _explain fpor Hanna Fuchs
{
"_index" : "people_large",
"_type" : "_doc",
"_id" : "58990879-ef33-4e72-9d98-6a8b4f350c69",
"matched" : true,
"explanation" : {
"value" : 75.664696,
"description" : "sum of:",
"details" : [
{
"value" : 4.697499,
"description" : "sum of:",
"details" : [
{
"value" : 4.697499,
"description" : "weight(street:zum in 17431) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 4.697499,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 5.254508,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 63888,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 12230000,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.40636092,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 1.5505662,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 47.311462,
"description" : "sum of:",
"details" : [
{
"value" : 47.311462,
"description" : "weight(lastname:fuchs in 17431) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 47.311462,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 4.4,
"description" : "boost",
"details" : [ ]
},
{
"value" : 15.913938,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 12230000,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.6756722,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 4.999789,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
},
{
"value" : 23.655731,
"description" : "weight(firstname:hanna in 17431) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 23.655731,
"description" : "score(freq=1.0), computed as boost * idf * tf from:",
"details" : [
{
"value" : 2.2,
"description" : "boost",
"details" : [ ]
},
{
"value" : 15.913938,
"description" : "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
"details" : [
{
"value" : 1,
"description" : "n, number of documents containing term",
"details" : [ ]
},
{
"value" : 12230000,
"description" : "N, total number of documents with field",
"details" : [ ]
}
]
},
{
"value" : 0.6756722,
"description" : "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
"details" : [
{
"value" : 1.0,
"description" : "freq, occurrences of term within document",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "k1, term saturation parameter",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "b, length normalization parameter",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "dl, length of field",
"details" : [ ]
},
{
"value" : 4.999789,
"description" : "avgdl, average length of field",
"details" : [ ]
}
]
}
]
}
]
}
]
}
}
I am a bit confused why the boost for Hanna Fuchs is 4.4 and for Mark Eich 48.4 . My understanding of the fuzzy query was that it will match with fuzzyness but if there are exact matches these are valued higher as the fuzzy matches.
I hope someone can help me with this problem or maybe can point me towards some further Documentation for how the boosting mechanism works or a detailed take on fuzzy querys. Because I am almost certain that the docs I already read are not enough to grasp this problem.
Thank you in advance.