Fuzzy Search not working as intended

amlanrath · March 29, 2017, 1:38pm

I changed the default Fuzziness logic of ES from Fuzziness.AUTO to my own which is as follows -

0-3 - no edits allowed or Fuzziness.ZERO
4-10 - 1 edit allowed or Fuzziness.ONE
10+ - 2 edits allowed or Fuzziness.TWO

Earlier the way fuzzy was running was -

MultiMatchQueryBuilder queryBuilder2 = QueryBuilders.multiMatchQuery(QueryParser.escape(suggestion), FULL_SEARCH_FIELDS).fuzziness(Fuzziness.AUTO).lenient(true).boost(0.1f);

After this change I also added different boosts to each field required as follows -

QueryStringQueryBuilder queryBuilder2 = QueryBuilders.queryStringQuery(QueryParser.escape(suggestion)).analyzer("hsproduct").fuzziness(fuzziness).lenient(true);
for (ProductFuzzySearchKeywordEnum field : ProductFuzzySearchKeywordEnum.values()) {
queryBuilder2.field(field.getFieldName(), field.getFieldBoost());
}

After the above changes, the FUZZY isn't matching some basic things like "SCHOOL" with search "SCOOL" and "FOOTWEAR" with "FOOTWEAT".

I am currently on ES 5.1 and would appreciate if anyone can find something wrong with what I have done.

dakrone · March 29, 2017, 4:10pm

What is your hsproduct analyzer doing? Can you paste the configuration for it?

venkata_sreekanth_bh · March 29, 2017, 5:06pm

try indexing the field with nGram analyzer. It generally yields better results then Fuzzy. set the mingram to smallest length of your string and max gram to largest length of your string.

amlanrath · March 30, 2017, 12:34am

The hsproduct analyzer is as follows -

{"settings":
{
"analysis": {
"analyzer": {
"hsproduct": {
"type": "custom",
"filter": ["standard", "lowercase", "stopword", "synonym", "stemmer"],
"tokenizer": "standard",
"char_filter": ["input_char_filter"]
}
},
"filter": {
"stopword": {
"type": "stop",
"ignore_case": "true"
},
"stemmer": {
"type": "stemmer",
"language": "english"
},
"synonym": {
"type": "synonym",
"synonyms":["hi,Hi"]
}
},
"char_filter": {
"input_char_filter": {
"type": "pattern_replace",
"pattern": "[^A-z0-9 ]+",
"replacement": ""
}
}
}
}
}

amlanrath · March 30, 2017, 12:39am

But doesn't Fuzzy Search match any and all strings corresponding to a particular Damaeu-Levenshtein distance, like matching SCOOL with SCHOOL if edit distance is defined as 1 and not matching COOL with SCHOOL as it required an edit distance of 2 ? Or is there any limitation of Fuzzy Search in ES that am not aware of ? Please let me know of any such limitation if there exists one.

Also Ngram tokenizers as per my knowledge are more useful for languages which very long compound words as per elasticsearch official site -

"The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length. They are useful for querying languages that don’t use spaces or that have long compound words, like German.", taken from https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-ngram-tokenizer.html

Also in my case the min_gram would range from 1-2 and max_gram would go till 10-15. This would produce a lot of tokens for each indexed field. Your views regarding this ?

venkata_sreekanth_bh · March 30, 2017, 12:55am

You are assuming that fuzzy would replace the characters you are expecting but the offset could be anywhere.

amlanrath · March 30, 2017, 1:00am

Ok, can you let me know what do you mean by this "offset" you are talking about and how it affects fuzzy searches ? I didn't know there existed such a thing. Also I have edited my earlier reply, can you have a look at that.

venkata_sreekanth_bh · March 30, 2017, 1:33am

offset meaning the positions of characters. Fuzzy allows for transposition, deletion, insertion and substitution for example

'abc' with fuzziness one will return 'acb' as also a valid result. In you example

"SCHOOL" with search "SCOOL" can match SDOOL, SCOOL SCHOOD etc. -- nGrams may not help here, regex queries will help more. https://www.elastic.co/guide/en/elasticsearch/reference/5.2/query-dsl-regexp-query.html

"FOOTWEAR" with "FOOTWEAT" nGrams will match here.

I also noticed you are setting a boost of 0.1f which is punishing for matching. Did you really want to boost use a value greater than 1. less than one lower the score.

nGrams will produce lot of space and take quite a bit of diskspace too. But your queries will run faster than wildcard or fuzzy.

amlanrath · March 30, 2017, 5:27am

Ok, I got why ngrams would be faster, but I am sorry I still didn't get why SCOOL will not match SCHOOL as insertion of a H should be allowed in fuzzy. Also since my highest relevance matching fields have a boost of 0.9f, I have adjust the boosting of other fields accordingly so that they don't get more relevancy than that one. Also why is it punishing to use a boost of 0.1f if all my other fields also use boosts in similar range as that ?

venkata_sreekanth_bh · March 30, 2017, 1:45pm

If you want boost a certain result use boost greater than one, if you want lower the relevancy use boost between 0-1. Please read documentation regarding this.

dakrone · March 30, 2017, 6:56pm

I tried reproducing your issue with HTTP requests, but I could not reproduce it.

The following correctly matches the "school" document when searching for "scool":

DELETE i

PUT /i
{
  "settings":
  {
    "analysis": {
      "analyzer": {
        "hsproduct": {
          "type": "custom",
          "filter": ["standard", "lowercase", "stopword", "synonym", "stemmer"],
          "tokenizer": "standard",
          "char_filter": ["input_char_filter"]
        }
      },
      "filter": {
        "stopword": {
          "type": "stop",
          "ignore_case": "true"
        },
        "stemmer": {
          "type": "stemmer",
          "language": "english"
        },
        "synonym": {
          "type": "synonym",
          "synonyms":["hi,Hi"]
        }
      },
      "char_filter": {
        "input_char_filter": {
          "type": "pattern_replace",
          "pattern": "[^A-z0-9 ]+",
          "replacement": ""
        }
      }
    }
  },
  "mappings": {
    "doc": {
      "properties": {
        "foo": {
          "type": "text",
          "analyzer": "hsproduct"
        }
      }
    }
  }
}

POST /i/_analyze
{
  "field": "foo",
  "text": "school"
}

POST /i/doc/1
{"foo": "school"}

// Matches the document
POST /i/_search
{
  "query": {
    "multi_match": {
      "query": "scool",
      "fields": ["foo"],
      "fuzziness": 1
    }
  }
}

system · April 27, 2017, 6:56pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch fuzzy search not working properly with simpleQueryStringQuery Elasticsearch	2	993	September 30, 2020
Fuzzy query don't working as expected Elasticsearch	3	683	March 9, 2023
Fuzzy match query unexpected results Elasticsearch	3	1347	July 5, 2017
Fuzzy search doesn't return all available results Elasticsearch	12	2106	January 26, 2022
Fuzzy Search on some selected fields Elasticsearch	1	570	July 6, 2017

Fuzzy Search not working as intended

Related topics