Fuzziness & score computation


(Adrian Luna) #1

Hi,

Sorry that I am relatively fresh to elasticsearch so please don't be too
harsh.

I feel like I'm not being able to understand the behaviour of any of the
fuzzy queries in ES.

1) match with fuzziness enabled

{
"query": {
"fuzzy_like_this_field": {
"field_name": {
"like_text": "car renting London",
"fuzziness": "0.5"
}
}
}
}

As I see it from my tests, this kind of query will give same score to
documents with field_name="car renting London" and "car ranting London" or
"car renting Londen" for example. That means, it will not give any
negatively score misspellings. I can imagine that first the possible
variants are computed and then the score is just computed with a
"representative score" which is the same for every variant that match the
requirements.

Am I right? If I am, is it any way to boost the exact match over the fuzzy
match?

Also I get results with more terms getting the same score, like "cheap car
renting London", "offers car renting London". That's something I cannot get
to understand. When I use the explain API, it seems that the resulting
score is a sum of the different matches with its internal weightings,
tf-idf, etc. but it seems to not be considering the terms outside the
query, while I would expect the exact match to score at least slightly
higher.

Am I missing something here? Is it just the expected result and I am just
being too demanding?

2) fuzzy query

That doesn't make what I want since it does not analyze the query (I think)
and so it will treat the query in an unexpected way for my purposes of
"free text" search

3) fuzzy_like_this or fuzzy_like_this_field

This other search takes rid of the first problem in point 1, since as I
read from the documentation, it seems to use some tricks to avoid favouring
rare terms (misspellings will be here) over more frequent terms, etc. but
it's still giving the same score to exact match and matches where other
terms are present.

Is there any way to get the expected behaviour?. By this I mean to be able
to execute almost free-text queries with some fuzziness to take rid of
possible misspellings in the query terms, but with an (at least for me)
more exhaustive score computation. If not, is there any other more complex
query or a function_score to get such a performance.

Thank you very much, any comment will be pretty much appreciated. Also, if
I am not right in my suppositions, any clarification will be very welcome.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/916f5408-ecfd-4676-8d48-db4467a9d839%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Zachary Tong) #2

You are correct in your analysis of the fuzzy scoring. Fuzzy variants are
scored (relatively) the same as the exact match, because they are treated
the same when executed internally.

If you want to score exact matches higher, I would use a boolean
combination of an exact match and a fuzzy match. Semi-pseudo-query here:

{
"query": {
"bool": {
"should": [
{
"match" : {
"my_field" : {
"query" : "car renting london",
"operator" : "and"
},
"boost" : 2
}
},
{
"fuzzy_like_this": {}
}
]
}
}
}

Basically, the match query is set to AND operator (so all terms are
required) and it is given a boost of 2. That means that exact matches will
be boosted preferentially over the fuzzy matches, which will have the
default boost of 1.

Also I get results with more terms getting the same score, like "cheap car

renting London", "offers car renting London".

The reason you are seeing results like this is because you are using the
fuzzy_like_this query. It's a combination of more_like_this and fuzzy.
The way MLT works is that it takes all the individual terms in your query,
builds a big boolean and searches the index for the boolean. Docs just
need the terms, in no particular order. The Fuzzy Like This works the
same, except terms are allowed to fuzzily match. With MLT and FLT, you're
bound to find "off-target" results because these queries are sorta like
shotguns, looking for a wide spread of terms.

2) fuzzy query

That doesn't make what I want since it does not analyze the query (I
think) and so it will treat the query in an unexpected way for my purposes
of "free text" search

As an alternative, you can use the Match query and set the "fuzziness"
parameter. You'll get fuzzy like the fuzzy query, but analysis from the
Match query.

As a general comment, trying to deal with misspellings and fuzziness is
always a game between precision (number of returned results that are
correct) and recall (number of correct results that are returned). As you
increase fuzziness, you increase recall -- more of your correct results are
in your search hits...but you lose precision...they may be at position 200.
You'll always be battling the precision/recall fight.

I would instead search for exact matches, and prompt user to fix
mispellings with suggesters. This makes your search and relevancy vastly simpler,
and tends to provide a better user experience because they can just click
the as-you-type suggestion or the "Did you mean?" link. Win win for
everyone.

-Zach

On Thursday, March 20, 2014 4:46:49 AM UTC-5, Adrian Luna wrote:

Hi,

Sorry that I am relatively fresh to elasticsearch so please don't be too
harsh.

I feel like I'm not being able to understand the behaviour of any of the
fuzzy queries in ES.

1) match with fuzziness enabled

{
"query": {
"fuzzy_like_this_field": {
"field_name": {
"like_text": "car renting London",
"fuzziness": "0.5"
}
}
}
}

As I see it from my tests, this kind of query will give same score to
documents with field_name="car renting London" and "car ranting London" or
"car renting Londen" for example. That means, it will not give any
negatively score misspellings. I can imagine that first the possible
variants are computed and then the score is just computed with a
"representative score" which is the same for every variant that match the
requirements.

Am I right? If I am, is it any way to boost the exact match over the fuzzy
match?

Also I get results with more terms getting the same score, like "cheap car
renting London", "offers car renting London". That's something I cannot get
to understand. When I use the explain API, it seems that the resulting
score is a sum of the different matches with its internal weightings,
tf-idf, etc. but it seems to not be considering the terms outside the
query, while I would expect the exact match to score at least slightly
higher.

Am I missing something here? Is it just the expected result and I am just
being too demanding?

2) fuzzy query

That doesn't make what I want since it does not analyze the query (I
think) and so it will treat the query in an unexpected way for my purposes
of "free text" search

3) fuzzy_like_this or fuzzy_like_this_field

This other search takes rid of the first problem in point 1, since as I
read from the documentation, it seems to use some tricks to avoid favouring
rare terms (misspellings will be here) over more frequent terms, etc. but
it's still giving the same score to exact match and matches where other
terms are present.

Is there any way to get the expected behaviour?. By this I mean to be able
to execute almost free-text queries with some fuzziness to take rid of
possible misspellings in the query terms, but with an (at least for me)
more exhaustive score computation. If not, is there any other more complex
query or a function_score to get such a performance.

Thank you very much, any comment will be pretty much appreciated. Also, if
I am not right in my suppositions, any clarification will be very welcome.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a8e3e438-9d27-449f-81c2-b50907dcd184%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #3