This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.
I have the same problem, where some results with higher edit distance are
ranked higher than other results that are closer in terms of edit distance.
I suspect it does have to do with document frequency, as you think Adrien.
In my case I want to ignore document frequency completely. Any suggestion
to achieve this?
I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.
I can try to create this other rewrite method you mentioned if you could
point me in the right direction.
Thanks
On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:
This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.
On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner <eylon....@gmail.com
<javascript:>> wrote:
For now try FuzzyLikeThis
(Elasticsearch Platform — Find real-time answers at scale | Elastic
)
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.
I have the same problem, where some results with higher edit distance are
ranked higher than other results that are closer in terms of edit distance.
I suspect it does have to do with document frequency, as you think Adrien.
In my case I want to ignore document frequency completely. Any suggestion
to achieve this?
I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.
I can try to create this other rewrite method you mentioned if you could
point me in the right direction.
Thanks
On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:
This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.
Thanks Mark. Sounds like this issue affects a lot of people.
I looked at your suggestion about FLT, and the ignore_tf parameter should
help, however unless I'm missing something, it doesn't seem like this would
address the IDF, and results could be biased. But I will experiment.
Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.
On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:
For now try FuzzyLikeThis ( Elasticsearch Platform — Find real-time answers at scale | Elastic
)
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.
I have the same problem, where some results with higher edit distance are
ranked higher than other results that are closer in terms of edit distance.
I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?
I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.
I can try to create this other rewrite method you mentioned if you could
point me in the right direction.
Thanks
On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:
This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.
Thanks Mark. Sounds like this issue affects a lot of people.
I looked at your suggestion about FLT, and the ignore_tf parameter should
help, however unless I'm missing something, it doesn't seem like this would
address the IDF, and results could be biased. But I will experiment.
Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.
On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:
For now try FuzzyLikeThis ( Elasticsearch Platform — Find real-time answers at scale | Elastic
)
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.
I have the same problem, where some results with higher edit distance
are ranked higher than other results that are closer in terms of edit
distance.
I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?
I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.
I can try to create this other rewrite method you mentioned if you could
point me in the right direction.
Thanks
On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:
This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.
Thanks Mark. Sounds like this issue affects a lot of people.
I looked at your suggestion about FLT, and the ignore_tf parameter should
help, however unless I'm missing something, it doesn't seem like this would
address the IDF, and results could be biased. But I will experiment.
Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.
On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:
For now try FuzzyLikeThis (http://www.elasticsearch.org/
guide/en/elasticsearch/reference/current/query-dsl-
flt-query.html#query-dsl-flt-query )
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.
I have the same problem, where some results with higher edit distance
are ranked higher than other results that are closer in terms of edit
distance.
I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?
I'm a taker of any solution as this looks like a show stopper for us,
so even a workaround would help.
I can try to create this other rewrite method you mentioned if you
could point me in the right direction.
Thanks
On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:
This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.
It's been nearly 10 years but I took a quick look at the code and the IDF
balancing stuff is still in there.
Testing queries against a large index of cars Lucene's standard fuzzy query
on ford~ still has top matches that aren't Fords. FLT works fine.
On Tuesday, January 20, 2015 at 9:11:56 AM UTC, Itamar Syn-Hershko wrote:
Thanks Mark. Sounds like this issue affects a lot of people.
I looked at your suggestion about FLT, and the ignore_tf parameter
should help, however unless I'm missing something, it doesn't seem like
this would address the IDF, and results could be biased. But I will
experiment.
Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.
On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:
For now try FuzzyLikeThis (http://www.elasticsearch.org/
guide/en/elasticsearch/reference/current/query-dsl-
flt-query.html#query-dsl-flt-query )
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.
I have the same problem, where some results with higher edit distance
are ranked higher than other results that are closer in terms of edit
distance.
I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?
I'm a taker of any solution as this looks like a show stopper for us,
so even a workaround would help.
I can try to create this other rewrite method you mentioned if you
could point me in the right direction.
Thanks
On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:
This is because the score takes two factors into account: the
document frequency and the edit distance. Quite likely in your case, even
though Boss is closer than Bose, Bose has a much lower document frequency
which helped it eventually get a better score. I guess we should have
another rewrite method that would not take freqs into account (or somehow
merge them) to avoid that issue.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.