Fuzzy Search not working as intended

I changed the default Fuzziness logic of ES from Fuzziness.AUTO to my own which is as follows -

0-3 - no edits allowed or Fuzziness.ZERO
4-10 - 1 edit allowed or Fuzziness.ONE
10+ - 2 edits allowed or Fuzziness.TWO

Earlier the way fuzzy was running was -

MultiMatchQueryBuilder queryBuilder2 = QueryBuilders.multiMatchQuery(QueryParser.escape(suggestion), FULL_SEARCH_FIELDS).fuzziness(Fuzziness.AUTO).lenient(true).boost(0.1f);

After this change I also added different boosts to each field required as follows -

QueryStringQueryBuilder queryBuilder2 = QueryBuilders.queryStringQuery(QueryParser.escape(suggestion)).analyzer("hsproduct").fuzziness(fuzziness).lenient(true);
for (ProductFuzzySearchKeywordEnum field : ProductFuzzySearchKeywordEnum.values()) {
queryBuilder2.field(field.getFieldName(), field.getFieldBoost());
}

After the above changes, the FUZZY isn't matching some basic things like "SCHOOL" with search "SCOOL" and "FOOTWEAR" with "FOOTWEAT".

I am currently on ES 5.1 and would appreciate if anyone can find something wrong with what I have done.

What is your hsproduct analyzer doing? Can you paste the configuration for it?

try indexing the field with nGram analyzer. It generally yields better results then Fuzzy. set the mingram to smallest length of your string and max gram to largest length of your string.

The hsproduct analyzer is as follows -

{"settings":
{
"analysis": {
"analyzer": {
"hsproduct": {
"type": "custom",
"filter": ["standard", "lowercase", "stopword", "synonym", "stemmer"],
"tokenizer": "standard",
"char_filter": ["input_char_filter"]
}
},
"filter": {
"stopword": {
"type": "stop",
"ignore_case": "true"
},
"stemmer": {
"type": "stemmer",
"language": "english"
},
"synonym": {
"type": "synonym",
"synonyms":["hi,Hi"]
}
},
"char_filter": {
"input_char_filter": {
"type": "pattern_replace",
"pattern": "[^A-z0-9 ]+",
"replacement": ""
}
}
}
}
}

But doesn't Fuzzy Search match any and all strings corresponding to a particular Damaeu-Levenshtein distance, like matching SCOOL with SCHOOL if edit distance is defined as 1 and not matching COOL with SCHOOL as it required an edit distance of 2 ? Or is there any limitation of Fuzzy Search in ES that am not aware of ? Please let me know of any such limitation if there exists one.

Also Ngram tokenizers as per my knowledge are more useful for languages which very long compound words as per elasticsearch official site -

"The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length. They are useful for querying languages that don’t use spaces or that have long compound words, like German.", taken from https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

Also in my case the min_gram would range from 1-2 and max_gram would go till 10-15. This would produce a lot of tokens for each indexed field. Your views regarding this ?

You are assuming that fuzzy would replace the characters you are expecting but the offset could be anywhere.

Ok, can you let me know what do you mean by this "offset" you are talking about and how it affects fuzzy searches ? I didn't know there existed such a thing. Also I have edited my earlier reply, can you have a look at that.

offset meaning the positions of characters. Fuzzy allows for transposition, deletion, insertion and substitution for example

'abc' with fuzziness one will return 'acb' as also a valid result. In you example

"SCHOOL" with search "SCOOL" can match SDOOL, SCOOL SCHOOD etc. -- nGrams may not help here, regex queries will help more. https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl-regexp-query.html

"FOOTWEAR" with "FOOTWEAT" nGrams will match here.

I also noticed you are setting a boost of 0.1f which is punishing for matching. Did you really want to boost use a value greater than 1. less than one lower the score.

nGrams will produce lot of space and take quite a bit of diskspace too. But your queries will run faster than wildcard or fuzzy.

Ok, I got why ngrams would be faster, but I am sorry I still didn't get why SCOOL will not match SCHOOL as insertion of a H should be allowed in fuzzy. Also since my highest relevance matching fields have a boost of 0.9f, I have adjust the boosting of other fields accordingly so that they don't get more relevancy than that one. Also why is it punishing to use a boost of 0.1f if all my other fields also use boosts in similar range as that ?

If you want boost a certain result use boost greater than one, if you want lower the relevancy use boost between 0-1. Please read documentation regarding this.

I tried reproducing your issue with HTTP requests, but I could not reproduce it.

The following correctly matches the "school" document when searching for "scool":

DELETE i

PUT /i
{
  "settings":
  {
    "analysis": {
      "analyzer": {
        "hsproduct": {
          "type": "custom",
          "filter": ["standard", "lowercase", "stopword", "synonym", "stemmer"],
          "tokenizer": "standard",
          "char_filter": ["input_char_filter"]
        }
      },
      "filter": {
        "stopword": {
          "type": "stop",
          "ignore_case": "true"
        },
        "stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "synonym": {
          "type": "synonym",
          "synonyms":["hi,Hi"]
        }
      },
      "char_filter": {
        "input_char_filter": {
          "type": "pattern_replace",
          "pattern": "[^A-z0-9 ]+",
          "replacement": ""
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "foo": {
          "type": "text",
          "analyzer": "hsproduct"
        }
      }
    }
  }
}

POST /i/_analyze
{
  "field": "foo",
  "text": "school"
}

POST /i/doc/1
{"foo": "school"}

// Matches the document
POST /i/_search
{
  "query": {
    "multi_match": {
      "query": "scool",
      "fields": ["foo"],
      "fuzziness": 1
    }
  }
}

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.