When searching for 'Boss' with fuzziness, get higher score for 'Bose' than 'Boss'. ? How Comes !?!?

Any ideas?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.

On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner eylon.steiner@gmail.com
wrote:

Any ideas?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7-7SbX_CVizbC%3DwCf9jyNSfkn4zy-GEqEj0sdBZGkRrg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I have the same problem, where some results with higher edit distance are
ranked higher than other results that are closer in terms of edit distance.

I suspect it does have to do with document frequency, as you think Adrien.
In my case I want to ignore document frequency completely. Any suggestion
to achieve this?

I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.

I can try to create this other rewrite method you mentioned if you could
point me in the right direction.

Thanks

On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:

This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.

On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner <eylon....@gmail.com
<javascript:>> wrote:

Any ideas?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e92f143a-c4db-488d-9db4-7bedfaa14d2d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

This issue rounds up a bunch of related issues that have been raised
previously: Wrap stacked tokens in `match` query in a BlendedTerms query for better scoring · Issue #9103 · elastic/elasticsearch · GitHub

For now try FuzzyLikeThis
(Elasticsearch Platform — Find real-time answers at scale | Elastic
)
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.

On Monday, January 19, 2015 at 6:48:49 PM UTC, kasper...@yahoo.com wrote:

I have the same problem, where some results with higher edit distance are
ranked higher than other results that are closer in terms of edit distance.

I suspect it does have to do with document frequency, as you think Adrien.
In my case I want to ignore document frequency completely. Any suggestion
to achieve this?

I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.

I can try to create this other rewrite method you mentioned if you could
point me in the right direction.

Thanks

On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:

This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.

On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner eylon....@gmail.com
wrote:

Any ideas?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8328d71a-b2be-40aa-abdc-ebeddb9a713d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks Mark. Sounds like this issue affects a lot of people.

I looked at your suggestion about FLT, and the ignore_tf parameter should
help, however unless I'm missing something, it doesn't seem like this would
address the IDF, and results could be biased. But I will experiment.

Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.

On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:

This issue rounds up a bunch of related issues that have been raised
previously: Wrap stacked tokens in `match` query in a BlendedTerms query for better scoring · Issue #9103 · elastic/elasticsearch · GitHub

For now try FuzzyLikeThis (
Elasticsearch Platform — Find real-time answers at scale | Elastic
)
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.

On Monday, January 19, 2015 at 6:48:49 PM UTC, kasper...@yahoo.com wrote:

I have the same problem, where some results with higher edit distance are
ranked higher than other results that are closer in terms of edit distance.

I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?

I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.

I can try to create this other rewrite method you mentioned if you could
point me in the right direction.

Thanks

On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:

This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.

On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner eylon....@gmail.com
wrote:

Any ideas?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/cbfa04e8-afc4-46fe-b945-d006f89ca90f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

it doesn't seem like this would address the IDF

Trust me, I wrote it.

On Tuesday, January 20, 2015 at 12:16:44 AM UTC, kasper...@yahoo.com wrote:

Thanks Mark. Sounds like this issue affects a lot of people.

I looked at your suggestion about FLT, and the ignore_tf parameter should
help, however unless I'm missing something, it doesn't seem like this would
address the IDF, and results could be biased. But I will experiment.

Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.

On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:

This issue rounds up a bunch of related issues that have been raised
previously: Wrap stacked tokens in `match` query in a BlendedTerms query for better scoring · Issue #9103 · elastic/elasticsearch · GitHub

For now try FuzzyLikeThis (
Elasticsearch Platform — Find real-time answers at scale | Elastic
)
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.

On Monday, January 19, 2015 at 6:48:49 PM UTC, kasper...@yahoo.com wrote:

I have the same problem, where some results with higher edit distance
are ranked higher than other results that are closer in terms of edit
distance.

I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?

I'm a taker of any solution as this looks like a show stopper for us, so
even a workaround would help.

I can try to create this other rewrite method you mentioned if you could
point me in the right direction.

Thanks

On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:

This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.

On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner eylon....@gmail.com
wrote:

Any ideas?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9523b3d5-ffea-4760-9782-69167b9807ed%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Famous last words :slight_smile:

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Jan 20, 2015 at 11:11 AM, Mark Harwood <
mark.harwood@elasticsearch.com> wrote:

it doesn't seem like this would address the IDF

Trust me, I wrote it.

On Tuesday, January 20, 2015 at 12:16:44 AM UTC, kasper...@yahoo.com
wrote:

Thanks Mark. Sounds like this issue affects a lot of people.

I looked at your suggestion about FLT, and the ignore_tf parameter should
help, however unless I'm missing something, it doesn't seem like this would
address the IDF, and results could be biased. But I will experiment.

Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.

On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:

This issue rounds up a bunch of related issues that have been raised
previously: Wrap stacked tokens in `match` query in a BlendedTerms query for better scoring · Issue #9103 · elastic/elasticsearch · GitHub

For now try FuzzyLikeThis (http://www.elasticsearch.org/
guide/en/elasticsearch/reference/current/query-dsl-
flt-query.html#query-dsl-flt-query )
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.

On Monday, January 19, 2015 at 6:48:49 PM UTC, kasper...@yahoo.com
wrote:

I have the same problem, where some results with higher edit distance
are ranked higher than other results that are closer in terms of edit
distance.

I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?

I'm a taker of any solution as this looks like a show stopper for us,
so even a workaround would help.

I can try to create this other rewrite method you mentioned if you
could point me in the right direction.

Thanks

On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:

This is because the score takes two factors into account: the document
frequency and the edit distance. Quite likely in your case, even though
Boss is closer than Bose, Bose has a much lower document frequency which
helped it eventually get a better score. I guess we should have another
rewrite method that would not take freqs into account (or somehow merge
them) to avoid that issue.

On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner eylon....@gmail.com
wrote:

Any ideas?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9523b3d5-ffea-4760-9782-69167b9807ed%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9523b3d5-ffea-4760-9782-69167b9807ed%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAHTr4Zt5ycYCcwVkqL%2BMazATz5nS5VXtDq6DHmUv2KS%2BrKE_SQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

:slight_smile:
It's been nearly 10 years but I took a quick look at the code and the IDF
balancing stuff is still in there.
Testing queries against a large index of cars Lucene's standard fuzzy query
on ford~ still has top matches that aren't Fords. FLT works fine.

On Tuesday, January 20, 2015 at 9:11:56 AM UTC, Itamar Syn-Hershko wrote:

Famous last words :slight_smile:

--

Itamar Syn-Hershko
http://code972.com | @synhershko https://twitter.com/synhershko
Freelance Developer & Consultant
Lucene.NET committer and PMC member

On Tue, Jan 20, 2015 at 11:11 AM, Mark Harwood <
mark.h...@elasticsearch.com <javascript:>> wrote:

it doesn't seem like this would address the IDF

Trust me, I wrote it.

On Tuesday, January 20, 2015 at 12:16:44 AM UTC, kasper...@yahoo.com
wrote:

Thanks Mark. Sounds like this issue affects a lot of people.

I looked at your suggestion about FLT, and the ignore_tf parameter
should help, however unless I'm missing something, it doesn't seem like
this would address the IDF, and results could be biased. But I will
experiment.

Ultimately I think what my particular use case requires is a scorer that
only uses edit distance (when querying with fuzziness) and field boosts,
but no TF / IDF.

On Monday, January 19, 2015 at 3:15:47 PM UTC-8, Mark Harwood wrote:

This issue rounds up a bunch of related issues that have been raised
previously: Wrap stacked tokens in `match` query in a BlendedTerms query for better scoring · Issue #9103 · elastic/elasticsearch · GitHub

For now try FuzzyLikeThis (http://www.elasticsearch.org/
guide/en/elasticsearch/reference/current/query-dsl-
flt-query.html#query-dsl-flt-query )
It blends More Like This and fuzzy functionality but includes the
adjustments to IDF that I think make more sense than the other
implementations with their bias towards rewarding scarcity.

On Monday, January 19, 2015 at 6:48:49 PM UTC, kasper...@yahoo.com
wrote:

I have the same problem, where some results with higher edit distance
are ranked higher than other results that are closer in terms of edit
distance.

I suspect it does have to do with document frequency, as you think
Adrien. In my case I want to ignore document frequency completely. Any
suggestion to achieve this?

I'm a taker of any solution as this looks like a show stopper for us,
so even a workaround would help.

I can try to create this other rewrite method you mentioned if you
could point me in the right direction.

Thanks

On Thursday, January 15, 2015 at 7:44:57 AM UTC-8, Adrien Grand wrote:

This is because the score takes two factors into account: the
document frequency and the edit distance. Quite likely in your case, even
though Boss is closer than Bose, Bose has a much lower document frequency
which helped it eventually get a better score. I guess we should have
another rewrite method that would not take freqs into account (or somehow
merge them) to avoid that issue.

On Thu, Jan 15, 2015 at 4:06 PM, Eylon Steiner eylon....@gmail.com
wrote:

Any ideas?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/52e09e54-
90b6-4014-8454-34e3db5756e5%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/52e09e54-90b6-4014-8454-34e3db5756e5%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
Adrien Grand

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9523b3d5-ffea-4760-9782-69167b9807ed%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/9523b3d5-ffea-4760-9782-69167b9807ed%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/84f178ba-d150-4b7d-9a54-b419bc962499%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.