Fuzzy search

diaz · March 10, 2016, 10:51am

Hi

I am trying to make a search of movie titles using fuzzy search but I am not understanding well how it works. I tried with the "fuzzy query", "math" and "query_string" but I am not getting the results that I would like to get .

Basically, I would like to show the results with their scores (in percentage preferably), and I would like to use a minimum percent that it should match . So far, I am getting results only if the strings are very similar.

For example, if there is a title called "Kill Bill: Vol. 1" and I search:

"Kill Bill: Vol. 2" It finds the title "Kill Bill: Vol. 1", so it is OK
"K9ill Bill: Vol. 2" It doesn't find anything,

there are only 2 letters of difference, could I set up the fuzzy to get more results please?

I inserted the following data:

curl -XPUT "http://localhost:9200/movies/movie/5" -d'
{
    "title": "Kill Bill: Vol. 1",
    "director": "Quentin Tarantino",
    "year": 2003,
    "genres": ["Action", "Crime", "Thriller"]
}'

If I search "Kill Bill: Vol. 2" I get results, and that is ok.

GET /_search
{
  "query":{
    "query_string" : {
        "default_field" : "content",
        "fields" : ["title"],
        "query" : "Kill Bill: Vol. 3",
        "minimum_should_match" : "10%"    
    }
  }
}

but I would like to get results even if they do a mistake a write "Ki9ll Bill: Vol. 2"

GET /_search
{
  "query":{
    "query_string" : {
        "default_field" : "content",
        "fields" : ["title"],
        "query" : "K2ill Bill: Vol. 3",
        "minimum_should_match" : "10%"    
    }
  }
}

could you guide me please? Thanks

Mark_Harwood · March 10, 2016, 12:36pm

The explain API helps show low-level details of why something doesn't match.

Using the URL below you can show specifically why doc 5 does or does not match:

GET /movies/movie/5/_explain
{
  "query":{
	"query_string" : {
		"fields" : ["title"],
		"query" : "K2ill Bill: Vol. 3",
		"minimum_should_match" : "10%"    
	}
  }
}

The partial results are below:

{
   "_index": "movies",
   "_type": "movie",
   "_id": "5",
   "matched": false,
   "explanation": {
	  "value": 0,
	  "description": "Failure to meet condition(s) of required/prohibited clause(s)",
	  "details": [
		 {
			"value": 0,
			"description": "no match on required clause (title:k2ill Bill:vol title:3)",
			"details": [
			   {
				  "value": 0,
				  "description": "No matching clauses",
				  "details": []
			   }
			]
		 }
		 ...
     	  
}

Note that the query is looking for the word "vol" in the non-existant field "Bill" because of the colon in the query string.
You have two options:

Switch to a query parser that doesn't support field names and the colon (e.g. simple query parser)
Escape your strings appropriately to avoid special characters.

...oh, and always check out the explain api

diaz · March 10, 2016, 4:02pm

Thanks Mark for your answer, I checked again the documentation and try with different parameter, but I don't understand why at least one term has to match 100%, for example if I change the title to

PUT /movies/movie/6
{
    "title": "The Professional"
}

and I search

GET /movies/movie/6/_explain
{
  "query":{
    "query_string" : {
		"fields" : ["title"],
		"query" : "Te Profesional",
		"minimum_should_match" : "10%"
	}
  }
}

I don't get results, the levenshtein distance is only 2 of 16,

Should I define a analyzer? use a plugin? could you guide me please?

Mark_Harwood · March 10, 2016, 4:03pm

That percentage figure is minimum number of search terms not minimum number of characters.

So it's fuzziness in the sense of how many of the given words have to match, not fuzziness at the level of how many characters inside each word.

diaz · March 10, 2016, 4:29pm

Thanks Mark, I will not keep trying then

I am searching for something different then because many titles have one, two or three words. do you know if there Is there any algorithm in ElasticSearch or plugin for it please?

dmurga · March 11, 2016, 2:15am

If you have the flexibility of considering a commercial Elasticsearch plug-in, Basis Technology (full disclosure: I'm VP, Engineering there) has a fuzzy name matcher plug-in: http://www.basistech.com/fuzzy-search-names-in-elasticsearch/. You can also get some idea of options with Elasticsearch out-of-the-box at that link. It's been optimized more for person, place and organization names than movies but should help.

diaz · March 11, 2016, 8:44am

Hi David,

No I don't have flexibility, I thought to use it in an open source solution.

There are many documentation about it an examples, such as:

and there are many functions in multiple languages to do that, like similar_text in Php http://php.net/manual/en/function.similar-text.php

I guess that if it doesn't exist already in Elastic Search it is about time.

Thanks anyway

Mark_Harwood · March 11, 2016, 10:56am

There are various ways of doing "fuzzy" in the sense of character matching.

Some are index-time decisions you can make and some are query-time operations.

One index-time decision is to chop words into sub-strings using "n-grams" so that "word" is indexed as multiple tokens eg [wo] [wor] [word] [or] [ord] [ord ] [rd] [rd ] etc.

Another index-time decision is to encode words as tokens that try and represent the actual sounds they make rather than their spelling. See "metaphone" which according to [1] would be the token "MTFN".

A query-time (and therefore more CPU intensive operation) is to do string comparisons on the fly comparing search terms with the words in the index and there are various queries e.g. match query, fuzzyquery and query_string (using the tilda character) that support this form of fuzzy matching.

I recommend reading the elastic guide on all this: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzzy-matching.html

[1] http://metaphone.onlinephpfunctions.com/

Topic		Replies	Views
How can you know if a fuzzy search has exact matches in returned results Elasticsearch	4	592	September 25, 2018
Fuzzy 'query_string' search Elasticsearch	2	313	July 6, 2017
No results in fuzzy query_string query with 0.01 fuzzy_min_sim value Elasticsearch	6	1082	July 6, 2017
Confusing results from fuzzy query (1 term, 1 field) Elasticsearch	2	417	July 6, 2017
Fuzzy search question Elasticsearch	8	1173	May 23, 2020

Fuzzy search

Related topics