Fuzzy search


#1

Hi

I am trying to make a search of movie titles using fuzzy search but I am not understanding well how it works. I tried with the "fuzzy query", "math" and "query_string" but I am not getting the results that I would like to get .

Basically, I would like to show the results with their scores (in percentage preferably), and I would like to use a minimum percent that it should match . So far, I am getting results only if the strings are very similar.

For example, if there is a title called "Kill Bill: Vol. 1" and I search:

  • "Kill Bill: Vol. 2" It finds the title "Kill Bill: Vol. 1", so it is OK
  • "K9ill Bill: Vol. 2" It doesn't find anything,

there are only 2 letters of difference, could I set up the fuzzy to get more results please?

I inserted the following data:

curl -XPUT "http://localhost:9200/movies/movie/5" -d'
{
    "title": "Kill Bill: Vol. 1",
    "director": "Quentin Tarantino",
    "year": 2003,
    "genres": ["Action", "Crime", "Thriller"]
}'

If I search "Kill Bill: Vol. 2" I get results, and that is ok.

GET /_search
{
  "query":{
    "query_string" : {
        "default_field" : "content",
        "fields" : ["title"],
        "query" : "Kill Bill: Vol. 3",
        "minimum_should_match" : "10%"    
    }
  }
}

but I would like to get results even if they do a mistake a write "Ki9ll Bill: Vol. 2"

GET /_search
{
  "query":{
    "query_string" : {
        "default_field" : "content",
        "fields" : ["title"],
        "query" : "K2ill Bill: Vol. 3",
        "minimum_should_match" : "10%"    
    }
  }
}

could you guide me please? Thanks


(Mark Harwood) #2

The explain API helps show low-level details of why something doesn't match.

Using the URL below you can show specifically why doc 5 does or does not match:

GET /movies/movie/5/_explain
{
  "query":{
	"query_string" : {
		"fields" : ["title"],
		"query" : "K2ill Bill: Vol. 3",
		"minimum_should_match" : "10%"    
	}
  }
}

The partial results are below:

{
   "_index": "movies",
   "_type": "movie",
   "_id": "5",
   "matched": false,
   "explanation": {
	  "value": 0,
	  "description": "Failure to meet condition(s) of required/prohibited clause(s)",
	  "details": [
		 {
			"value": 0,
			"description": "no match on required clause (title:k2ill Bill:vol title:3)",
			"details": [
			   {
				  "value": 0,
				  "description": "No matching clauses",
				  "details": []
			   }
			]
		 }
		 ...
     	  
}

Note that the query is looking for the word "vol" in the non-existant field "Bill" because of the colon in the query string.
You have two options:

  1. Switch to a query parser that doesn't support field names and the colon (e.g. simple query parser)
  2. Escape your strings appropriately to avoid special characters.

...oh, and always check out the explain api :slightly_smiling:


#3

Thanks Mark for your answer, I checked again the documentation and try with different parameter, but I don't understand why at least one term has to match 100%, for example if I change the title to

PUT /movies/movie/6
{
    "title": "The Professional"
}

and I search

GET /movies/movie/6/_explain
{
  "query":{
    "query_string" : {
		"fields" : ["title"],
		"query" : "Te Profesional",
		"minimum_should_match" : "10%"
	}
  }
}

I don't get results, the levenshtein distance is only 2 of 16,

Should I define a analyzer? use a plugin? could you guide me please?


(Mark Harwood) #4

That percentage figure is minimum number of search terms not minimum number of characters.

So it's fuzziness in the sense of how many of the given words have to match, not fuzziness at the level of how many characters inside each word.


#5

Thanks Mark, I will not keep trying then :slightly_smiling:

I am searching for something different then because many titles have one, two or three words. do you know if there Is there any algorithm in ElasticSearch or plugin for it please?


(David Murgatroyd) #6

If you have the flexibility of considering a commercial Elasticsearch plug-in, Basis Technology (full disclosure: I'm VP, Engineering there) has a fuzzy name matcher plug-in: http://www.basistech.com/fuzzy-search-names-in-elasticsearch/. You can also get some idea of options with Elasticsearch out-of-the-box at that link. It's been optimized more for person, place and organization names than movies but should help.


#7

Hi David,

No I don't have flexibility, I thought to use it in an open source solution.

There are many documentation about it an examples, such as:


and there are many functions in multiple languages to do that, like similar_text in Php http://php.net/manual/en/function.similar-text.php

I guess that if it doesn't exist already in Elastic Search it is about time.

Thanks anyway


(Mark Harwood) #8

There are various ways of doing "fuzzy" in the sense of character matching.

Some are index-time decisions you can make and some are query-time operations.

One index-time decision is to chop words into sub-strings using "n-grams" so that "word" is indexed as multiple tokens eg [wo] [wor] [word] [or] [ord] [ord ] [rd] [rd ] etc.

Another index-time decision is to encode words as tokens that try and represent the actual sounds they make rather than their spelling. See "metaphone" which according to [1] would be the token "MTFN".

A query-time (and therefore more CPU intensive operation) is to do string comparisons on the fly comparing search terms with the words in the index and there are various queries e.g. match query, fuzzyquery and query_string (using the tilda character) that support this form of fuzzy matching.

I recommend reading the elastic guide on all this: https://www.elastic.co/guide/en/elasticsearch/guide/master/fuzzy-matching.html

[1] http://metaphone.onlinephpfunctions.com/


(system) #9