Fuzzy matching and direct hit ranking

Woody_Peterson · May 15, 2012, 6:23pm

I'm trying to implement fuzzy matching, and wanted to get a sanity check,
as I'm hitting what was initially a surprising use case.

Let's say I have an index with documents: [{'name' => 'Coleman'}, {'name'
=> 'Boleman'}]. Doing a fuzzy_like_this search for 'coleman' will
non-deterministically return either document first, whereas before I read
up on the details of fuzzy search I would have expected the results to have
taken distance into account.

After reading some of the documentation and relevant posts on this forum, I
understand that what it's doing is expanding the search to all terms in the
index within a percentage-wise distance of the word. So in the above
example, my current understanding is that a search for 'coleman' is
literally the same thing as searching for 'coleman' and 'boleman'.

First of all, is this correct? Second, is how do I achieve my desired
behavior?

My first thought is to do a dis_max query with both a text and
fuzzy_like_this. Would anyone pursue a different strategy instead?

Thanks!

-Woody

rpsandiford · May 15, 2012, 8:05pm

You might want to check that you are doing a lower case at indexing time.

Because - the distance between "Coleman" and "coleman" is the same as the distance between "Boleman" and "coleman" - i.e. both require a one-character replacement.

Bob Sandiford | Lead Software Engineer | SirsiDynix
P: 800.288.8020 X6943 | Bob.Sandiford@sirsidynix.com mailto:Bob.Sandiford@sirsidynix.com
www.sirsidynix.com http://www.sirsidynix.com/

Join the conversation: Like us on Facebook!http://www.facebook.com/SirsiDynix Follow us on Twitter!http://twitter.com/SirsiDynix

From: Woody Peterson [via ElasticSearch Users] [mailto:ml-node+s115913n3988602h9@n3.nabble.com]
Sent: Tuesday, May 15, 2012 2:23 PM
To: Bob Sandiford
Subject: fuzzy matching and direct hit ranking

I'm trying to implement fuzzy matching, and wanted to get a sanity check, as I'm hitting what was initially a surprising use case.

Let's say I have an index with documents: [{'name' => 'Coleman'}, {'name' => 'Boleman'}]. Doing a fuzzy_like_this search for 'coleman' will non-deterministically return either document first, whereas before I read up on the details of fuzzy search I would have expected the results to have taken distance into account.

After reading some of the documentation and relevant posts on this forum, I understand that what it's doing is expanding the search to all terms in the index within a percentage-wise distance of the word. So in the above example, my current understanding is that a search for 'coleman' is literally the same thing as searching for 'coleman' and 'boleman'.

First of all, is this correct? Second, is how do I achieve my desired behavior?

My first thought is to do a dis_max query with both a text and fuzzy_like_this. Would anyone pursue a different strategy instead?

Thanks!

-Woody

If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/fuzzy-matching-and-direct-hit-ranking-tp3988602.html
To start a new topic under ElasticSearch Users, email ml-node+s115913n115913h50@n3.nabble.com mailto:ml-node+s115913n115913h50@n3.nabble.com
To unsubscribe from ElasticSearch Users, click herehttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=115913&code=Ym9iLnNhbmRpZm9yZEBzaXJzaWR5bml4LmNvbXwxMTU5MTN8LTIxMTYxMTI0NTQ=.
NAMLhttp://elasticsearch-users.115913.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html!nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers!nabble%3Aemail.naml-instant_emails!nabble%3Aemail.naml-send_instant_email!nabble%3Aemail.naml

Bob_Sandiford · May 16, 2012, 2:58pm

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same as
the distance between “Boleman” and “coleman” – i.e. both require a one-
character replacement.

Bob.

Woody_Peterson · May 16, 2012, 4:46pm

the distance between “Coleman” and “coleman” is the same as the distance
between “Boleman” and “coleman”

I tried a ton of stuff yesterday, and that might explain differences I was
seeing between fuzzy_like_this vs 'text' searches w/ a fuziness ('text'
searches go through normal tokenization, while fuzzy_like_this apparently
doesn't). I think if your field uses the standard analyzer, it should be
lowercased on index, so you would just have to lowercase it on search for
fuzzy_like_this or use 'text', no?

I eventually settled on the following, although it still exhibits this edge
case bug:

curl -XPUT 'http://localhost:9200/test/user/1' -d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/user/2' -d '{"name":"Boleman"}'

Now do the following, replacing 'Coleman' with 'Boleman', 'coleman', and
'boleman':

curl -X GET "http://localhost:9200/test/_search?pretty=true" -d
'{"query":{"text":{"_all":{"query":"Coleman","fuzziness":0.8}}},"explain":true}'

I get a score of 0.30685282 for both documents in all search variations.
The order is not indeterminate as I stated in my first email, this is
only seen in some cases of my particular dataset because every once in a
while it will hit a document with some tiny additional factor, such as a
slightly different fieldNorm, that will boost one doc over the other and
appear to the user as nonsensical. This test case, however, is consistent,
although it isn't clear to me what secondary sorting is going on to keep
'Coleman' always ahead of 'Boleman'; I'm assuming insertion order.

-Woody

On Wednesday, May 16, 2012 7:58:29 AM UTC-7, Bob Sandiford wrote:

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same as
the distance between “Boleman” and “coleman” – i.e. both require a one-
character replacement.

Bob.

Woody_Peterson · May 16, 2012, 4:52pm

curl -XPUT 'http://localhost:9200/test/user/1' -d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/user/2' -d '{"name":"Boleman"}'
... I get a score of 0.30685282 for both documents in all search
variations

I also meant to point out that I get exactly the same results (same score
and everything) when doing '{"name":"coleman"}' and '{"name":"boleman"}',
also. Which I would predict, but is nice to verify.

On Wednesday, May 16, 2012 9:46:25 AM UTC-7, Woody Peterson wrote:

the distance between “Coleman” and “coleman” is the same as the distance
between “Boleman” and “coleman”

I tried a ton of stuff yesterday, and that might explain differences I was
seeing between fuzzy_like_this vs 'text' searches w/ a fuziness ('text'
searches go through normal tokenization, while fuzzy_like_this apparently
doesn't). I think if your field uses the standard analyzer, it should be
lowercased on index, so you would just have to lowercase it on search for
fuzzy_like_this or use 'text', no?

I eventually settled on the following, although it still exhibits this
edge case bug:

curl -XPUT 'http://localhost:9200/test/user/1' -d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/user/2' -d '{"name":"Boleman"}'

Now do the following, replacing 'Coleman' with 'Boleman', 'coleman', and
'boleman':

curl -X GET "http://localhost:9200/test/_search?pretty=true" -d
'{"query":{"text":{"_all":{"query":"Coleman","fuzziness":0.8}}},"explain":true}'

I get a score of 0.30685282 for both documents in all search variations.
The order is not indeterminate as I stated in my first email, this is
only seen in some cases of my particular dataset because every once in a
while it will hit a document with some tiny additional factor, such as a
slightly different fieldNorm, that will boost one doc over the other and
appear to the user as nonsensical. This test case, however, is consistent,
although it isn't clear to me what secondary sorting is going on to keep
'Coleman' always ahead of 'Boleman'; I'm assuming insertion order.

-Woody

On Wednesday, May 16, 2012 7:58:29 AM UTC-7, Bob Sandiford wrote:

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same as
the distance between “Boleman” and “coleman” – i.e. both require a one-
character replacement.

Bob.

Woody_Peterson · May 17, 2012, 6:02pm

Based on Bob's reaction, I would say at least one other person would expect
elasticsearch to rank direct hits above close fuzzy hits. If this is not
the case, is it a bug, or am I doing it wrong?

On Wednesday, May 16, 2012 9:52:13 AM UTC-7, Woody Peterson wrote:

curl -XPUT 'http://localhost:9200/test/user/1' -d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/user/2' -d '{"name":"Boleman"}'
... I get a score of 0.30685282 for both documents in all search
variations

I also meant to point out that I get exactly the same results (same score
and everything) when doing '{"name":"coleman"}' and '{"name":"boleman"}',
also. Which I would predict, but is nice to verify.

On Wednesday, May 16, 2012 9:46:25 AM UTC-7, Woody Peterson wrote:

the distance between “Coleman” and “coleman” is the same as the
distance between “Boleman” and “coleman”

I tried a ton of stuff yesterday, and that might explain differences I
was seeing between fuzzy_like_this vs 'text' searches w/ a fuziness ('text'
searches go through normal tokenization, while fuzzy_like_this apparently
doesn't). I think if your field uses the standard analyzer, it should be
lowercased on index, so you would just have to lowercase it on search for
fuzzy_like_this or use 'text', no?

I eventually settled on the following, although it still exhibits this
edge case bug:

curl -XPUT 'http://localhost:9200/test/user/1' -d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/user/2' -d '{"name":"Boleman"}'

Now do the following, replacing 'Coleman' with 'Boleman', 'coleman', and
'boleman':

curl -X GET "http://localhost:9200/test/_search?pretty=true" -d
'{"query":{"text":{"_all":{"query":"Coleman","fuzziness":0.8}}},"explain":true}'

I get a score of 0.30685282 for both documents in all search variations.
The order is not indeterminate as I stated in my first email, this is
only seen in some cases of my particular dataset because every once in a
while it will hit a document with some tiny additional factor, such as a
slightly different fieldNorm, that will boost one doc over the other and
appear to the user as nonsensical. This test case, however, is consistent,
although it isn't clear to me what secondary sorting is going on to keep
'Coleman' always ahead of 'Boleman'; I'm assuming insertion order.

-Woody

On Wednesday, May 16, 2012 7:58:29 AM UTC-7, Bob Sandiford wrote:

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same as
the distance between “Boleman” and “coleman” – i.e. both require a one-
character replacement.

Bob.

kimchy · May 20, 2012, 8:05pm

Yes, you will need to do text query and fuzzy and rank the text one
higher...

On Thu, May 17, 2012 at 8:02 PM, Woody Peterson woody.peterson@gmail.comwrote:

Based on Bob's reaction, I would say at least one other person would
expect elasticsearch to rank direct hits above close fuzzy hits. If this is
not the case, is it a bug, or am I doing it wrong?

On Wednesday, May 16, 2012 9:52:13 AM UTC-7, Woody Peterson wrote:

curl -XPUT 'http://localhost:9200/test/**user/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**user/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'
... I get a score of 0.30685282 for both documents in all search
variations

I also meant to point out that I get exactly the same results (same score
and everything) when doing '{"name":"coleman"}' and '{"name":"boleman"}',
also. Which I would predict, but is nice to verify.

On Wednesday, May 16, 2012 9:46:25 AM UTC-7, Woody Peterson wrote:

the distance between “Coleman” and “coleman” is the same as the
distance between “Boleman” and “coleman”

I tried a ton of stuff yesterday, and that might explain differences I
was seeing between fuzzy_like_this vs 'text' searches w/ a fuziness ('text'
searches go through normal tokenization, while fuzzy_like_this apparently
doesn't). I think if your field uses the standard analyzer, it should be
lowercased on index, so you would just have to lowercase it on search for
fuzzy_like_this or use 'text', no?

I eventually settled on the following, although it still exhibits this
edge case bug:

curl -XPUT 'http://localhost:9200/test/**user/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**user/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'

Now do the following, replacing 'Coleman' with 'Boleman', 'coleman', and
'boleman':

curl -X GET "http://localhost:9200/test/_**search?pretty=true http://localhost:9200/test/_search?pretty=true"
-d '{"query":{"text":{"_all":{"query":"Coleman","fuzziness":
0.8}}},"explain":true}'

I get a score of 0.30685282 for both documents in all search variations.
The order is not indeterminate as I stated in my first email, this is
only seen in some cases of my particular dataset because every once in a
while it will hit a document with some tiny additional factor, such as a
slightly different fieldNorm, that will boost one doc over the other and
appear to the user as nonsensical. This test case, however, is consistent,
although it isn't clear to me what secondary sorting is going on to keep
'Coleman' always ahead of 'Boleman'; I'm assuming insertion order.

-Woody

On Wednesday, May 16, 2012 7:58:29 AM UTC-7, Bob Sandiford wrote:

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same as
the distance between “Boleman” and “coleman” – i.e. both require a one-
character replacement.

Bob.

Woody_Peterson · May 21, 2012, 4:30pm

This thread Carbon60: Managed Cloud Services
has the following comment that suggests it is a lucene option which
strategy to take:

it should score "farming" higher than "farmin" by
default, but the default rewrite mode also takes TF/IDF into account (in
addition). You can change that by a different rewrite method:

The default is: MultiTermQuery.TopTermsScoringBooleanQueryRewrite (Lucene 3.5.0 API) (which combines the standard vector
model with additionally boosting exact matches - we have that for
backwards
compatibility only, its not what most users expect)

The better one is: MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite (Lucene 3.5.0 API), which does not take TF/IDF into
account and only boosts by levensthein distance.

You can disable fuzzy boosting altogether:
Additionally ScoringRewrite (Lucene 3.5.0 API) provides two other scoring models
(TF/IDF
only, no boosting - or constant score at all)

Would it make sense to expose these options for use in elasticsearch?

On Sunday, May 20, 2012 1:05:04 PM UTC-7, kimchy wrote:

Yes, you will need to do text query and fuzzy and rank the text one
higher...

On Thu, May 17, 2012 at 8:02 PM, Woody Peterson wrote:

Based on Bob's reaction, I would say at least one other person would
expect elasticsearch to rank direct hits above close fuzzy hits. If this is
not the case, is it a bug, or am I doing it wrong?

On Wednesday, May 16, 2012 9:52:13 AM UTC-7, Woody Peterson wrote:

curl -XPUT 'http://localhost:9200/test/**user/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**user/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'
... I get a score of 0.30685282 for both documents in all search
variations

I also meant to point out that I get exactly the same results (same
score and everything) when doing '{"name":"coleman"}' and
'{"name":"boleman"}', also. Which I would predict, but is nice to verify.

On Wednesday, May 16, 2012 9:46:25 AM UTC-7, Woody Peterson wrote:

the distance between “Coleman” and “coleman” is the same as the
distance between “Boleman” and “coleman”

I tried a ton of stuff yesterday, and that might explain differences I
was seeing between fuzzy_like_this vs 'text' searches w/ a fuziness ('text'
searches go through normal tokenization, while fuzzy_like_this apparently
doesn't). I think if your field uses the standard analyzer, it should be
lowercased on index, so you would just have to lowercase it on search for
fuzzy_like_this or use 'text', no?

I eventually settled on the following, although it still exhibits this
edge case bug:

curl -XPUT 'http://localhost:9200/test/**user/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**user/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'

Now do the following, replacing 'Coleman' with 'Boleman', 'coleman',
and 'boleman':

curl -X GET "http://localhost:9200/test/_**search?pretty=true http://localhost:9200/test/_search?pretty=true"
-d '{"query":{"text":{"_all":{"query":"Coleman","fuzziness":
0.8}}},"explain":true}'

I get a score of 0.30685282 for both documents in all search
variations. The order is not indeterminate as I stated in my first email,
this is only seen in some cases of my particular dataset because every once
in a while it will hit a document with some tiny additional factor, such as
a slightly different fieldNorm, that will boost one doc over the other and
appear to the user as nonsensical. This test case, however, is consistent,
although it isn't clear to me what secondary sorting is going on to keep
'Coleman' always ahead of 'Boleman'; I'm assuming insertion order.

-Woody

On Wednesday, May 16, 2012 7:58:29 AM UTC-7, Bob Sandiford wrote:

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same as
the distance between “Boleman” and “coleman” – i.e. both require a
one-
character replacement.

Bob.

kimchy · May 23, 2012, 9:57pm

Sadly, the Lucene query parser does not expose the ability to set the
rewrite method for fuzzy query, but we can work around that. You can set
the rewrite method for the query_string, but it applies to wildcard and
prefix queries. We can add a fuzzy rewrite method option as well here...

On Mon, May 21, 2012 at 6:30 PM, Woody Peterson woody.peterson@gmail.comwrote:

This thread Carbon60: Managed Cloud Services the following comment that suggests it is a lucene option which
strategy to take:

it should score "farming" higher than "farmin" by
default, but the default rewrite mode also takes TF/IDF into account (in
addition). You can change that by a different rewrite method:

The default is: MultiTermQuery.TopTermsScoringBooleanQueryRewrite (Lucene 3.5.0 API) (which combines the standard vector
model with additionally boosting exact matches - we have that for
backwards
compatibility only, its not what most users expect)

The better one is: MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite (Lucene 3.5.0 API), which does not take TF/IDF into
account and only boosts by levensthein distance.

You can disable fuzzy boosting altogether:
Additionally ScoringRewrite (Lucene 3.5.0 API) provides two other scoring models
(TF/IDF
only, no boosting - or constant score at all)

Would it make sense to expose these options for use in elasticsearch?

On Sunday, May 20, 2012 1:05:04 PM UTC-7, kimchy wrote:

Yes, you will need to do text query and fuzzy and rank the text one
higher...

On Thu, May 17, 2012 at 8:02 PM, Woody Peterson wrote:

Based on Bob's reaction, I would say at least one other person would

expect elasticsearch to rank direct hits above close fuzzy hits. If this is
not the case, is it a bug, or am I doing it wrong?

On Wednesday, May 16, 2012 9:52:13 AM UTC-7, Woody Peterson wrote:

curl -XPUT 'http://localhost:9200/test/**us**er/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**us**er/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'
... I get a score of 0.30685282 for both documents in all search
variations

I also meant to point out that I get exactly the same results (same
score and everything) when doing '{"name":"coleman"}' and
'{"name":"boleman"}', also. Which I would predict, but is nice to verify.

On Wednesday, May 16, 2012 9:46:25 AM UTC-7, Woody Peterson wrote:

the distance between “Coleman” and “coleman” is the same as the
distance between “Boleman” and “coleman”

I tried a ton of stuff yesterday, and that might explain differences I
was seeing between fuzzy_like_this vs 'text' searches w/ a fuziness ('text'
searches go through normal tokenization, while fuzzy_like_this apparently
doesn't). I think if your field uses the standard analyzer, it should be
lowercased on index, so you would just have to lowercase it on search for
fuzzy_like_this or use 'text', no?

I eventually settled on the following, although it still exhibits this
edge case bug:

curl -XPUT 'http://localhost:9200/test/**us**er/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**us**er/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'

Now do the following, replacing 'Coleman' with 'Boleman', 'coleman',
and 'boleman':

curl -X GET "http://localhost:9200/test/_**s**earch?pretty=true http://localhost:9200/test/_search?pretty=true"
-d '{"query":{"text":{"_all":{"query":"Coleman","fuzziness":0.
8}}},"explain":true}'

I get a score of 0.30685282 for both documents in all search
variations. The order is not indeterminate as I stated in my first email,
this is only seen in some cases of my particular dataset because every once
in a while it will hit a document with some tiny additional factor, such as
a slightly different fieldNorm, that will boost one doc over the other and
appear to the user as nonsensical. This test case, however, is consistent,
although it isn't clear to me what secondary sorting is going on to keep
'Coleman' always ahead of 'Boleman'; I'm assuming insertion order.

-Woody

On Wednesday, May 16, 2012 7:58:29 AM UTC-7, Bob Sandiford wrote:

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same as
the distance between “Boleman” and “coleman” – i.e. both require a
one-
character replacement.

Bob.

kimchy · May 23, 2012, 10:28pm

Had another look, and it seems like we can hook specific fuzzy settings to
the Lucene query parser and in other places, opened an issue:
Query DSL: Add more fuzzy options in different queries (text, query_string/field) · Issue #1974 · elastic/elasticsearch · GitHub.

On Wed, May 23, 2012 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Sadly, the Lucene query parser does not expose the ability to set the
rewrite method for fuzzy query, but we can work around that. You can set
the rewrite method for the query_string, but it applies to wildcard and
prefix queries. We can add a fuzzy rewrite method option as well here...

On Mon, May 21, 2012 at 6:30 PM, Woody Peterson woody.peterson@gmail.comwrote:

This thread Carbon60: Managed Cloud Services the following comment that suggests it is a lucene option which
strategy to take:

it should score "farming" higher than "farmin" by
default, but the default rewrite mode also takes TF/IDF into account
(in
addition). You can change that by a different rewrite method:

The default is: MultiTermQuery.TopTermsScoringBooleanQueryRewrite (Lucene 3.5.0 API) (which combines the standard
vector
model with additionally boosting exact matches - we have that for
backwards
compatibility only, its not what most users expect)

The better one is: MultiTermQuery.TopTermsBoostOnlyBooleanQueryRewrite (Lucene 3.5.0 API), which does not take TF/IDF
into
account and only boosts by levensthein distance.

You can disable fuzzy boosting altogether:
Additionally ScoringRewrite (Lucene 3.5.0 API) provides two other scoring models
(TF/IDF
only, no boosting - or constant score at all)

Would it make sense to expose these options for use in elasticsearch?

On Sunday, May 20, 2012 1:05:04 PM UTC-7, kimchy wrote:

Yes, you will need to do text query and fuzzy and rank the text one
higher...

On Thu, May 17, 2012 at 8:02 PM, Woody Peterson wrote:

Based on Bob's reaction, I would say at least one other person would

expect elasticsearch to rank direct hits above close fuzzy hits. If this is
not the case, is it a bug, or am I doing it wrong?

On Wednesday, May 16, 2012 9:52:13 AM UTC-7, Woody Peterson wrote:

curl -XPUT 'http://localhost:9200/test/**us**er/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**us**er/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'
... I get a score of 0.30685282 for both documents in all search
variations

I also meant to point out that I get exactly the same results (same
score and everything) when doing '{"name":"coleman"}' and
'{"name":"boleman"}', also. Which I would predict, but is nice to verify.

On Wednesday, May 16, 2012 9:46:25 AM UTC-7, Woody Peterson wrote:

the distance between “Coleman” and “coleman” is the same as the
distance between “Boleman” and “coleman”

I tried a ton of stuff yesterday, and that might explain differences
I was seeing between fuzzy_like_this vs 'text' searches w/ a fuziness
('text' searches go through normal tokenization, while fuzzy_like_this
apparently doesn't). I think if your field uses the standard analyzer, it
should be lowercased on index, so you would just have to lowercase it on
search for fuzzy_like_this or use 'text', no?

I eventually settled on the following, although it still exhibits
this edge case bug:

curl -XPUT 'http://localhost:9200/test/**us**er/1 http://localhost:9200/test/user/1'
-d '{"name":"Coleman"}'
curl -XPUT 'http://localhost:9200/test/**us**er/2 http://localhost:9200/test/user/2'
-d '{"name":"Boleman"}'

Now do the following, replacing 'Coleman' with 'Boleman', 'coleman',
and 'boleman':

curl -X GET "http://localhost:9200/test/_**s**earch?pretty=true http://localhost:9200/test/_search?pretty=true"
-d '{"query":{"text":{"_all":{"query":"Coleman","fuzziness":*0.
*8}}},"explain":true}'

I get a score of 0.30685282 for both documents in all search
variations. The order is not indeterminate as I stated in my first email,
this is only seen in some cases of my particular dataset because every once
in a while it will hit a document with some tiny additional factor, such as
a slightly different fieldNorm, that will boost one doc over the other and
appear to the user as nonsensical. This test case, however, is consistent,
although it isn't clear to me what secondary sorting is going on to keep
'Coleman' always ahead of 'Boleman'; I'm assuming insertion order.

-Woody

On Wednesday, May 16, 2012 7:58:29 AM UTC-7, Bob Sandiford wrote:

You might want to check that you are doing a lower case at indexing
time.

Because – the distance between “Coleman” and “coleman” is the same
as
the distance between “Boleman” and “coleman” – i.e. both require a
one-
character replacement.

Bob.

Topic		Replies	Views
Fuzziness & score computation Elasticsearch	2	5844	July 6, 2017
Is fuzzy query in elasticsearch related to fuzzy logic? Elasticsearch	3	769	July 5, 2017
Fuzzy query scoring based on levenshtein distance Elasticsearch	4	2680	July 6, 2017
Using must and falling to fuzzy if no match in queries Elasticsearch	3	1694	February 15, 2018
Fuzzy match with the indexed field Elasticsearch	1	643	July 5, 2017

Fuzzy matching and direct hit ranking

Related topics