Questions about Fuzzy Query

Hi guys,

I am brand new to ElasticSearch, and am currently exploring its
features. One of them I am interested in is the fuzzy query, which I
am testing and having troubles to use. It is probably a dummy question
so I guess someone who already used this feature will quickly find the
answer, at least I hope. :slight_smile:

Actually I already posted this message on StackOverflow but didn't get
any answer yet. If you want to take a look, this is the web page:

BTW I have the feeling that it might not be only related to
ElasticSearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an
object "label" with value "american football". This is the query I
use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?
pretty=true' -d '{ "node" : {
"label" : "american football"
}
}
'

This is the result I get.

{
"ok" : true,
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_version" : 1
}

So far so good, now I want to find this entry using a fuzzy query.
This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?
pretty=true' -d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american football",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

And this is the result I get

{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}

As you can see, no hit. But now, when I shrink a bit my query's value
from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?
pretty=true' -d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american footb",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

Then I get a correct hit on my entry, thus the result is:

{ "took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.19178301,
"hits" : [ {
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_score" : 0.19178301, "_source" : {
"node" : {
"label" : "american football"
}
}
} ]
}
}

So, I have several questions related to this test:

  1. Why I didn't get any result when performing a query with a value
    completely equals the my only entry "american football"

  2. Is it related to the fact that I have a multi-words value?

  3. Is there a way to get the "similarity" score in my query result so
    I can understand better how to find the right threshold for my fuzzy
    queries

  4. There is a page dedicated to fuzzy query on ElasticSearch web site,
    but I am not sure it lists all the potential parameters I can use for
    the fuzzy query. Were could I find such an exhaustive list?

  5. Same question for the other queries actually.

  6. is there a difference between a Fuzzy Query and a Query String
    Query using lucene syntax to get fuzzy matching?

Thanks in advance for your help!

Cheers,
Adrien

The problem is that the fuzzy query does not analyze the text provided (its
similar to term query in that regard). For that, we have the option to
provide "fuzziness" on the text query, which is analyzed. See sample here:
gist:2507732 ยท GitHub. Note though, fuzzy queries are not fast in
current Lucene version.

On Thu, Apr 26, 2012 at 4:16 AM, A_dit_rien adrien.ginesty@gmail.comwrote:

Hi guys,

I am brand new to Elasticsearch, and am currently exploring its
features. One of them I am interested in is the fuzzy query, which I
am testing and having troubles to use. It is probably a dummy question
so I guess someone who already used this feature will quickly find the
answer, at least I hope. :slight_smile:

Actually I already posted this message on StackOverflow but didn't get
any answer yet. If you want to take a look, this is the web page:
lucene - ElasticSearch's Fuzzy Query - Stack Overflow

BTW I have the feeling that it might not be only related to
Elasticsearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an
object "label" with value "american football". This is the query I
use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?
pretty=true' -d '{ "node" : {
"label" : "american football"
}
}
'

This is the result I get.

{
"ok" : true,
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_version" : 1
}

So far so good, now I want to find this entry using a fuzzy query.
This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?
pretty=true' -d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american football",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

And this is the result I get

{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" :
}
}

As you can see, no hit. But now, when I shrink a bit my query's value
from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?
pretty=true' -d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american footb",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

Then I get a correct hit on my entry, thus the result is:

{ "took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.19178301,
"hits" : [ {
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_score" : 0.19178301, "_source" : {
"node" : {
"label" : "american football"
}
}
} ]
}
}

So, I have several questions related to this test:

  1. Why I didn't get any result when performing a query with a value
    completely equals the my only entry "american football"

  2. Is it related to the fact that I have a multi-words value?

  3. Is there a way to get the "similarity" score in my query result so
    I can understand better how to find the right threshold for my fuzzy
    queries

  4. There is a page dedicated to fuzzy query on Elasticsearch web site,
    but I am not sure it lists all the potential parameters I can use for
    the fuzzy query. Were could I find such an exhaustive list?

  5. Same question for the other queries actually.

  6. is there a difference between a Fuzzy Query and a Query String
    Query using lucene syntax to get fuzzy matching?

Thanks in advance for your help!

Cheers,
Adrien

While trying to solve something similar, I played around with fuzzy
queries, fuzzy like this queries, fuzziness, and phonetic filters, and what
actually ended up working best for me was to simply use my query string
query and append each term with "~", which is a short cut to Lucene's fuzzy
query. So essentially what I do is submit query using "raw" term(s) entered
by user, and in the event I get back 0 results, I will resubmit the query,
using query string query and appending "~" to each of the user entered
term(s). The results are incredibly relevant. For example, entering "linerd
skinerd" will return results for one of the greatest rock bands of all time
"Lynyrd Skynyrd" ! This is really powerful, but beware of performance
impact, which is why I only use when initial search returns nothing.

On Wednesday, April 25, 2012 8:16:07 PM UTC-5, A_dit_rien wrote:

Hi guys,

I am brand new to Elasticsearch, and am currently exploring its
features. One of them I am interested in is the fuzzy query, which I
am testing and having troubles to use. It is probably a dummy question
so I guess someone who already used this feature will quickly find the
answer, at least I hope. :slight_smile:

Actually I already posted this message on StackOverflow but didn't get
any answer yet. If you want to take a look, this is the web page:
lucene - ElasticSearch's Fuzzy Query - Stack Overflow

BTW I have the feeling that it might not be only related to
Elasticsearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an
object "label" with value "american football". This is the query I
use.

bash-3.2$ curl -XPOST 'http://localhost:9200/firstindex/node/?
pretty=true http://localhost:9200/firstindex/node/?pretty=true' -d '{
"node" : {
"label" : "american football"
}
}
'

This is the result I get.

{
"ok" : true,
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_version" : 1
}

So far so good, now I want to find this entry using a fuzzy query.
This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?
pretty=true http://localhost:9200/firstindex/node/_search?pretty=true'
-d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american football",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

And this is the result I get

{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" :
}
}

As you can see, no hit. But now, when I shrink a bit my query's value
from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/firstindex/node/_search?
pretty=true http://localhost:9200/firstindex/node/_search?pretty=true'
-d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american footb",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

Then I get a correct hit on my entry, thus the result is:

{ "took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.19178301,
"hits" : [ {
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_score" : 0.19178301, "_source" : {
"node" : {
"label" : "american football"
}
}
} ]
}
}

So, I have several questions related to this test:

  1. Why I didn't get any result when performing a query with a value
    completely equals the my only entry "american football"

  2. Is it related to the fact that I have a multi-words value?

  3. Is there a way to get the "similarity" score in my query result so
    I can understand better how to find the right threshold for my fuzzy
    queries

  4. There is a page dedicated to fuzzy query on Elasticsearch web site,
    but I am not sure it lists all the potential parameters I can use for
    the fuzzy query. Were could I find such an exhaustive list?

  5. Same question for the other queries actually.

  6. is there a difference between a Fuzzy Query and a Query String
    Query using lucene syntax to get fuzzy matching?

Thanks in advance for your help!

Cheers,
Adrien

You can use the text query with the fuzziness parameter for that as well.

On Fri, May 11, 2012 at 7:26 PM, my3sons carey.boldenow@gmail.com wrote:

While trying to solve something similar, I played around with fuzzy
queries, fuzzy like this queries, fuzziness, and phonetic filters, and what
actually ended up working best for me was to simply use my query string
query and append each term with "~", which is a short cut to Lucene's fuzzy
query. So essentially what I do is submit query using "raw" term(s) entered
by user, and in the event I get back 0 results, I will resubmit the query,
using query string query and appending "~" to each of the user entered
term(s). The results are incredibly relevant. For example, entering "linerd
skinerd" will return results for one of the greatest rock bands of all time
"Lynyrd Skynyrd" ! This is really powerful, but beware of performance
impact, which is why I only use when initial search returns nothing.

On Wednesday, April 25, 2012 8:16:07 PM UTC-5, A_dit_rien wrote:

Hi guys,

I am brand new to Elasticsearch, and am currently exploring its
features. One of them I am interested in is the fuzzy query, which I
am testing and having troubles to use. It is probably a dummy question
so I guess someone who already used this feature will quickly find the
answer, at least I hope. :slight_smile:

Actually I already posted this message on StackOverflow but didn't get
any answer yet. If you want to take a look, this is the web page:
http://stackoverflow.com/**questions/10309199/**
elasticsearchs-fuzzy-queryhttp://stackoverflow.com/questions/10309199/elasticsearchs-fuzzy-query

BTW I have the feeling that it might not be only related to
Elasticsearch but maybe directly to Lucene.

Let's start with a new index named "first index" in which I store an
object "label" with value "american football". This is the query I
use.

bash-3.2$ curl -XPOST 'http://localhost:9200/**firstindex/node/?
pretty=true http://localhost:9200/firstindex/node/?pretty=true' -d '{
"node" : {
"label" : "american football"
}
}
'

This is the result I get.

{
"ok" : true,
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_version" : 1
}

So far so good, now I want to find this entry using a fuzzy query.
This is the one I send:

bash-3.2$ curl -XGET 'http://localhost:9200/**firstindex/node/_search?
pretty=true http://localhost:9200/firstindex/node/_search?pretty=true'
-d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american football",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

And this is the result I get

{
"took" : 15,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" :
}
}

As you can see, no hit. But now, when I shrink a bit my query's value
from "american football" to "american footb" like this:

bash-3.2$ curl -XGET 'http://localhost:9200/**firstindex/node/_search?
pretty=true http://localhost:9200/firstindex/node/_search?pretty=true'
-d ' {"query" : {
"fuzzy" : {
"label" : {
"value" : "american footb",
"boost" : 1.0,
"min_similarity" : 0.0,
"prefix_length" : 0
}
}
}
}
'

Then I get a correct hit on my entry, thus the result is:

{ "took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.19178301,
"hits" : [ {
"_index" : "firstindex",
"_type" : "node",
"_id" : "6TXNrLSESYepXPpFWjpl1A",
"_score" : 0.19178301, "_source" : {
"node" : {
"label" : "american football"
}
}
} ]
}
}

So, I have several questions related to this test:

  1. Why I didn't get any result when performing a query with a value
    completely equals the my only entry "american football"

  2. Is it related to the fact that I have a multi-words value?

  3. Is there a way to get the "similarity" score in my query result so
    I can understand better how to find the right threshold for my fuzzy
    queries

  4. There is a page dedicated to fuzzy query on Elasticsearch web site,
    but I am not sure it lists all the potential parameters I can use for
    the fuzzy query. Were could I find such an exhaustive list?

  5. Same question for the other queries actually.

  6. is there a difference between a Fuzzy Query and a Query String
    Query using lucene syntax to get fuzzy matching?

Thanks in advance for your help!

Cheers,
Adrien