Fuzzy query scoring based on levenshtein distance

Hi,

I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.

{
"query": {
"custom_score" : {
"query": {
"fuzzy": {
"firstname": {
"value": "Jack",
"min_similarity": "0.5",
"max_expansions": 1
}
}
},
"script" : "_score"
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

this is not entirely true. The FuzzyQuery uses the Levenshtein Distance to
find the terms in the index that are subsequentially used in a Boolean OR
query or in a ConstantScore Filter depending on the rewrite method you
choose. The default also just takes the top 50 terms within a certain LD
and then builds a query out of it. The scoring will just be the similarity
of you scoring model so TF/IDF (VectorSpace) by default.

I don't understand your last sentence, what do you mean by 'against the
min_similarity'?

simon

On Tuesday, March 12, 2013 6:45:09 PM UTC+1, the blab wrote:

Hi,

I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.

{
"query": {
"custom_score" : {
"query": {
"fuzzy": {
"firstname": {
"value": "Jack",
"min_similarity": "0.5",
"max_expansions": 1
}
}
},
"script" : "_score"
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for your response. By "against the min_similarity" I meant the
minimum value for the similarity of the fuzzy terms, i.e. the
min_similarity parameter provided in the query I posted.

To clarify there are two "scores" being calculated in the query: the
"levenshtein distance" to determine what terms to use, and the actual
scoring of the returned results. I wanted the levenshtein distance to be
used to score the returned results, but I don't think this is possible.

For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring.

Thanks

On Thursday, March 14, 2013 7:02:03 AM UTC, simonw wrote:

Hey,

this is not entirely true. The FuzzyQuery uses the Levenshtein Distance to
find the terms in the index that are subsequentially used in a Boolean OR
query or in a ConstantScore Filter depending on the rewrite method you
choose. The default also just takes the top 50 terms within a certain LD
and then builds a query out of it. The scoring will just be the similarity
of you scoring model so TF/IDF (VectorSpace) by default.

I don't understand your last sentence, what do you mean by 'against the
min_similarity'?

simon

On Tuesday, March 12, 2013 6:45:09 PM UTC+1, the blab wrote:

Hi,

I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.

{
"query": {
"custom_score" : {
"query": {
"fuzzy": {
"firstname": {
"value": "Jack",
"min_similarity": "0.5",
"max_expansions": 1
}
}
},
"script" : "_score"
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Blab,

I also want to return a score based on levenshtein distance from a fuzzy
query. Can you elaborate more on "writing a (native) script to handle the
scoring.", please? Did you actually write a script that calculates the
distance or did you use some ES properties?

Thank you,

On Thursday, March 14, 2013 7:52:02 PM UTC+2, the blab wrote:

Thanks for your response. By "against the min_similarity" I meant the
minimum value for the similarity of the fuzzy terms, i.e. the
min_similarity parameter provided in the query I posted.

To clarify there are two "scores" being calculated in the query: the
"levenshtein distance" to determine what terms to use, and the actual
scoring of the returned results. I wanted the levenshtein distance to be
used to score the returned results, but I don't think this is possible.

For future readers I solved this issue by creating a custom score query
and writing a (native) script to handle the scoring.

Thanks

On Thursday, March 14, 2013 7:02:03 AM UTC, simonw wrote:

Hey,

this is not entirely true. The FuzzyQuery uses the Levenshtein Distance
to find the terms in the index that are subsequentially used in a Boolean
OR query or in a ConstantScore Filter depending on the rewrite method you
choose. The default also just takes the top 50 terms within a certain LD
and then builds a query out of it. The scoring will just be the similarity
of you scoring model so TF/IDF (VectorSpace) by default.

I don't understand your last sentence, what do you mean by 'against the
min_similarity'?

simon

On Tuesday, March 12, 2013 6:45:09 PM UTC+1, the blab wrote:

Hi,

I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.

{
"query": {
"custom_score" : {
"query": {
"fuzzy": {
"firstname": {
"value": "Jack",
"min_similarity": "0.5",
"max_expansions": 1
}
}
},
"script" : "_score"
}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1012a30f-cdc9-4170-8b3f-c83866e2425d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.