I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.
this is not entirely true. The FuzzyQuery uses the Levenshtein Distance to
find the terms in the index that are subsequentially used in a Boolean OR
query or in a ConstantScore Filter depending on the rewrite method you
choose. The default also just takes the top 50 terms within a certain LD
and then builds a query out of it. The scoring will just be the similarity
of you scoring model so TF/IDF (VectorSpace) by default.
I don't understand your last sentence, what do you mean by 'against the
min_similarity'?
simon
On Tuesday, March 12, 2013 6:45:09 PM UTC+1, the blab wrote:
Hi,
I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.
Thanks for your response. By "against the min_similarity" I meant the
minimum value for the similarity of the fuzzy terms, i.e. the
min_similarity parameter provided in the query I posted.
To clarify there are two "scores" being calculated in the query: the
"levenshtein distance" to determine what terms to use, and the actual
scoring of the returned results. I wanted the levenshtein distance to be
used to score the returned results, but I don't think this is possible.
For future readers I solved this issue by creating a custom score query and
writing a (native) script to handle the scoring.
Thanks
On Thursday, March 14, 2013 7:02:03 AM UTC, simonw wrote:
Hey,
this is not entirely true. The FuzzyQuery uses the Levenshtein Distance to
find the terms in the index that are subsequentially used in a Boolean OR
query or in a ConstantScore Filter depending on the rewrite method you
choose. The default also just takes the top 50 terms within a certain LD
and then builds a query out of it. The scoring will just be the similarity
of you scoring model so TF/IDF (VectorSpace) by default.
I don't understand your last sentence, what do you mean by 'against the
min_similarity'?
simon
On Tuesday, March 12, 2013 6:45:09 PM UTC+1, the blab wrote:
Hi,
I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.
I also want to return a score based on levenshtein distance from a fuzzy
query. Can you elaborate more on "writing a (native) script to handle the
scoring.", please? Did you actually write a script that calculates the
distance or did you use some ES properties?
Thank you,
On Thursday, March 14, 2013 7:52:02 PM UTC+2, the blab wrote:
Thanks for your response. By "against the min_similarity" I meant the
minimum value for the similarity of the fuzzy terms, i.e. the
min_similarity parameter provided in the query I posted.
To clarify there are two "scores" being calculated in the query: the
"levenshtein distance" to determine what terms to use, and the actual
scoring of the returned results. I wanted the levenshtein distance to be
used to score the returned results, but I don't think this is possible.
For future readers I solved this issue by creating a custom score query
and writing a (native) script to handle the scoring.
Thanks
On Thursday, March 14, 2013 7:02:03 AM UTC, simonw wrote:
Hey,
this is not entirely true. The FuzzyQuery uses the Levenshtein Distance
to find the terms in the index that are subsequentially used in a Boolean
OR query or in a ConstantScore Filter depending on the rewrite method you
choose. The default also just takes the top 50 terms within a certain LD
and then builds a query out of it. The scoring will just be the similarity
of you scoring model so TF/IDF (VectorSpace) by default.
I don't understand your last sentence, what do you mean by 'against the
min_similarity'?
simon
On Tuesday, March 12, 2013 6:45:09 PM UTC+1, the blab wrote:
Hi,
I have a question about scoring for fuzzy queries. If I understand
correctly, fuzzy queries find any appropriate matches by calculating
similarity using the levenshtein distance, but this similarity value is not
used when calculating the document's score. Instead the document's score is
based on the tf/idf of the matched term. Is this correct? Is it possible to
instead score based on similarity to the queried term for fuzzy queries?
E.g. I have the below custom_score query. I'd like the score returned to be
the similarity score used to evaluate against the min_similarity.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.