Fuzzy query ranks misspellings over exact for repeated "close" tokens


(Jean Helou) #1

In this issue I reported a was where exact matches where ranked below fuzzy matches in some edge cases which unfortunately affect us very directly. I think this would affect a lot of catalog searches.

The problem occurs for us when searching for Porsche 911 with ES 5.x when it used to work fine in ES 1.7 with the since removed FLT query.
Porsche 911 is a long line of cars which was declined over time:

  • 911 type 911
  • 911 type 930
  • 911 type 964
  • 911 type 991
  • 911 type 993
  • 911 type 996
  • 911 type 997

Our domain experts (the people who really know their cars) tell us that car owners and people "in the field" will search for 911 997.
This used to bring up the 911 type 997 but with the latest ES returns the 911 type 991.

The same pbm occurs for most BMW since they are referred to by the rather imprecise reference320d, 318i,316i,etc which is what's written on the back of the trunk.

Jim Ferenczi was kind enough to offer some advices in his latest comment but I fail to understand how it would be implemented and hope someone here can provide more insight.

Here are the specific sentences I fail to understand;

If people search for 911 997 you can maybe make the words optional but applying fuzziness to this query is problematic for scoring even when the max distance is 1.
Having multiple clauses that match the same fuzzy word is an edge case that you can counterbalance by removing problematic words.

How do I "remove the problematic words" when they convey 100% of the information ?

He also states I should not apply fuzziness to a 3 letter words which I mostly agreee with. However not all of my database is composed of 3 letter words and I do need fuzzyness in my query for the 80% other cases.
As far as I can tell there are only 2 different modes for fuzzyness : Explicit max edit distance ( 0,1 or 2) and AUTO which DOES allow 1 edit to a 3 letter word.

My query as it is now with fuzzy multi match

GET vehicle_fr_fr/cartype/_search?search_type=dfs_query_then_fetch
{
  "query" : {
"bool" : {
  "should" : [
    {
      "multi_match" : {
        "query" : "911 997",
        "fields" : [
          "cartype_keywords^1.0",
          "cartype_search^1.0",
          "maker_keywords^1.0",
          "maker_search^1.0",
          "motor_keywords^1.0",
          "motor_search^1.0",
          "segment_keywords^1.0",
          "segment_search^1.0"
        ],
        "type" : "most_fields",
        "operator" : "OR",
        "slop" : 0,
        "fuzziness" : "AUTO",
        "prefix_length" : 2,
        "max_expansions" : 100,
        "lenient" : false,
        "zero_terms_query" : "NONE",
        "boost" : 1.0
      }
    },
    {"constant_score" : {
        "filter" : {
          "multi_match" : {
            "query" : "911 997",
            "fields" : [
              "phrase^1.0"
            ],
            "type" : "phrase",
            "operator" : "OR",
            "slop" : 0,
            "prefix_length" : 0,
            "max_expansions" : 50,
            "lenient" : false,
            "zero_terms_query" : "NONE",
            "boost" : 1.0
          }
        },
        "boost" : 3.0
      }},
    {"constant_score" : {
        "filter" : {
          "multi_match" : {
            "query" : "911 997",
            "fields" : [
              "cartype_search^1.0",
              "maker_search^1.0",
              "motor_search^1.0",
              "segment_search^1.0"
            ],
            "type" : "best_fields",
            "operator" : "OR",
            "slop" : 0,
            "prefix_length" : 0,
            "max_expansions" : 50,
            "lenient" : false,
            "zero_terms_query" : "NONE",
            "boost" : 1.0
          }
        },
        "boost" : 3.0
      }},
    {"constant_score" : {
        "filter" : {
          "multi_match" : {
            "query" : "911 997",
            "fields" : [
              "cartype_keywords^1.0",
              "maker_keywords^1.0",
              "motor_keywords^1.0",
              "segment_keywords^1.0"
            ],
            "type" : "best_fields",
            "operator" : "OR",
            "slop" : 0,
            "prefix_length" : 0,
            "max_expansions" : 50,
            "lenient" : false,
            "zero_terms_query" : "NONE",
            "boost" : 1.0
          }
        },
        "boost" : 3.0
      }}
  ],
  "disable_coord" : false,
  "adjust_pure_negative" : true,
  "boost" : 1.0
}
  },
  "highlight" : {
"pre_tags" : [
  "{"
],
"post_tags" : [
  "}"
],
"require_field_match" : true,
"fields" : {
  "phrase" : {
    "type" : "fvh"
  },
  "motor_search" : {
    "type" : "fvh"
  },
  "motor_keywords" : {
    "type" : "fvh"
  },
  "cartype_search" : {
    "type" : "fvh"
  },
  "cartype_keywords" : {
    "type" : "fvh"
  },
  "segment_search" : {
    "type" : "fvh"
  },
  "segment_keywords" : {
    "type" : "fvh"
  },
  "maker_search" : {
    "type" : "fvh"
  },
  "maker_keywords" : {
    "type" : "fvh"
  }
}
  }

}

The search field includes shingling, synonyms, character substitutions, word token splitting
The keyword field only includes synonyms and character substitutions


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.