Ignoring Density/Frequency

Hi All,

I'm trying to change search ranking and can't find info on how to do it. I
suspect I'm using the wrong terminology.

I am searching names within several million documents, say I search for
"Tom OR Jones". Some of my docs have many names in, some only a few. The
result is that a doc containing Aled Jones as the only name will score more
highly than a doc that includes Tom Jones and a handful of other names as
well.

This is expected as word density/frequency is taken into account in the
ranking right?

Is there a way to configure the ranking to not take into account the length
of the doc in the scoring? Such that all docs containing Tom Jones would
rank higher than Aled Jones regardless of the length of the doc?

What's the correct terminology to be searching for these kinds of settings?

thanks

rob

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You want omit_norms: true for the field(s) whose behaviour you want to
change, I guess. Dissect the "explain" part of your query and you will see
the tf-idf calculations.

http://www.elasticsearch.org/guide/reference/mapping/core-types/

You can google for much more information.

On Tue, May 7, 2013 at 7:05 AM, Rob Styles rob@dynamicorange.com wrote:

Hi All,

I'm trying to change search ranking and can't find info on how to do it. I
suspect I'm using the wrong terminology.

I am searching names within several million documents, say I search for
"Tom OR Jones". Some of my docs have many names in, some only a few. The
result is that a doc containing Aled Jones as the only name will score more
highly than a doc that includes Tom Jones and a handful of other names as
well.

This is expected as word density/frequency is taken into account in the
ranking right?

Is there a way to configure the ranking to not take into account the
length of the doc in the scoring? Such that all docs containing Tom Jones
would rank higher than Aled Jones regardless of the length of the doc?

What's the correct terminology to be searching for these kinds of settings?

thanks

rob

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Looked at our implementation again--there are three things relevant to your
case.

  1. You may need DFS_QUERY_THEN_FETCH to get accurate tf-idf calculations
    across shards.
  2. Use omit_norms as mentioned earlier
  3. Use omit_tf. We use both for several fields.

Those are the three "levers" to investigate.

Hope this helps.

On Tue, May 7, 2013 at 10:24 AM, Randall McRee randall.mcree@gmail.comwrote:

You want omit_norms: true for the field(s) whose behaviour you want to
change, I guess. Dissect the "explain" part of your query and you will see
the tf-idf calculations.

http://www.elasticsearch.org/guide/reference/mapping/core-types/

You can google for much more information.

On Tue, May 7, 2013 at 7:05 AM, Rob Styles rob@dynamicorange.com wrote:

Hi All,

I'm trying to change search ranking and can't find info on how to do it.
I suspect I'm using the wrong terminology.

I am searching names within several million documents, say I search for
"Tom OR Jones". Some of my docs have many names in, some only a few. The
result is that a doc containing Aled Jones as the only name will score more
highly than a doc that includes Tom Jones and a handful of other names as
well.

This is expected as word density/frequency is taken into account in the
ranking right?

Is there a way to configure the ranking to not take into account the
length of the doc in the scoring? Such that all docs containing Tom Jones
would rank higher than Aled Jones regardless of the length of the doc?

What's the correct terminology to be searching for these kinds of
settings?

thanks

rob

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Fantastic - thanks for the pointers :slight_smile:

rob

On Tuesday, May 7, 2013 6:31:24 PM UTC+1, RKM wrote:

Looked at our implementation again--there are three things relevant to
your case.

  1. You may need DFS_QUERY_THEN_FETCH to get accurate tf-idf calculations
    across shards.
  2. Use omit_norms as mentioned earlier
  3. Use omit_tf. We use both for several fields.

Those are the three "levers" to investigate.

Hope this helps.

On Tue, May 7, 2013 at 10:24 AM, Randall McRee <randal...@gmail.com<javascript:>

wrote:

You want omit_norms: true for the field(s) whose behaviour you want to
change, I guess. Dissect the "explain" part of your query and you will see
the tf-idf calculations.

http://www.elasticsearch.org/guide/reference/mapping/core-types/

You can google for much more information.

On Tue, May 7, 2013 at 7:05 AM, Rob Styles <r...@dynamicorange.com<javascript:>

wrote:

Hi All,

I'm trying to change search ranking and can't find info on how to do it.
I suspect I'm using the wrong terminology.

I am searching names within several million documents, say I search for
"Tom OR Jones". Some of my docs have many names in, some only a few. The
result is that a doc containing Aled Jones as the only name will score more
highly than a doc that includes Tom Jones and a handful of other names as
well.

This is expected as word density/frequency is taken into account in the
ranking right?

Is there a way to configure the ranking to not take into account the
length of the doc in the scoring? Such that all docs containing Tom Jones
would rank higher than Aled Jones regardless of the length of the doc?

What's the correct terminology to be searching for these kinds of
settings?

thanks

rob

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.