Subtle scoring issue with multi-value fields' fieldNorm being calculated as if they are one concatenated value

Hi all,

I'm facing a subtle scoring issue, which is no doubt Lucene related
but I'm wondering if there's a good ES solution to it.

A similar problem is summed up for 'that other Lucene search engine'
in this well put StackOverflow question:


-- but is still relevant to ES.

Essentially, when scoring a multi-valued field -- the length for the
fieldNorm is calculated as if the multiple values of the field are
concatenated together, instead of having a unique fieldNorm for each
value.

I'm guessing behind the scene Lucene is indexing the multiple-values
as one big value. This penalises multi-value fields significantly.

A good example of why this is undesirable is in the above
StackOverflow question. Indexing people with multiple aliases:

Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke
Person 2: David Letterman
Person 3: David Hasselhoff, David Michael Hasselhoff

Currently, searching for "David" Person 2 comes first and Person 1
comes in last.

Intuitively, I'd expect the opposite-- that searching for "David"
would bring up Person 1 or Person 3 first (for matching 2 values in
the field), and then the other of the two, and Person 2 should come in
dead last.

It seems Person 1 is penalised for having multiple field values which
mean that the field length is increased greatly as the aggregated
field value has the most tokens, and the fieldNorm suffers.

Is there some ES mapping option to prevent this from happening? Or
alternatively, a query DSL directive to prevent it?

The alternative workaround is to index each permutation as a separate
document and somehow group them in the query -- what is a good way of
doing this with ES?

Many thanks,
Tal

On Mon, 2012-02-20 at 16:56 +1100, Tal Rotbart wrote:

Hi all,

I'm facing a subtle scoring issue, which is no doubt Lucene related
but I'm wondering if there's a good ES solution to it.

A similar problem is summed up for 'that other Lucene search engine'
in this well put StackOverflow question:
lucene - Scoring of solr multivalued field - Stack Overflow
-- but is still relevant to ES.

Essentially, when scoring a multi-valued field -- the length for the
fieldNorm is calculated as if the multiple values of the field are
concatenated together, instead of having a unique fieldNorm for each
value.

Great question, and an interesting link provided by Simon:
http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/api/contrib-misc/org/apache/lucene/misc/SweetSpotSimilarity.html

Looking forward to hearing more about this

clint

On Mon, 2012-02-20 at 07:24 +0100, Clinton Gormley wrote:

On Mon, 2012-02-20 at 16:56 +1100, Tal Rotbart wrote:

Hi all,

I'm facing a subtle scoring issue, which is no doubt Lucene related
but I'm wondering if there's a good ES solution to it.

A similar problem is summed up for 'that other Lucene search engine'
in this well put StackOverflow question:
lucene - Scoring of solr multivalued field - Stack Overflow
-- but is still relevant to ES.

Essentially, when scoring a multi-valued field -- the length for the
fieldNorm is calculated as if the multiple values of the field are
concatenated together, instead of having a unique fieldNorm for each
value.

Setting the mapping for the field to {"omit_norms": true} seems to work:

However, I'm not entirely sure exactly what the impact of that mapping
is, and why Simon didn't mention it in his response to that question:

Interested to hear more from Lucene knowledgables.

clint

Yes, omitting norms would do the trick as well, and it makes more sense to set it in this case.

On Monday, February 20, 2012 at 8:37 AM, Clinton Gormley wrote:

On Mon, 2012-02-20 at 07:24 +0100, Clinton Gormley wrote:

On Mon, 2012-02-20 at 16:56 +1100, Tal Rotbart wrote:

Hi all,

I'm facing a subtle scoring issue, which is no doubt Lucene related
but I'm wondering if there's a good ES solution to it.

A similar problem is summed up for 'that other Lucene search engine'
in this well put StackOverflow question:
lucene - Scoring of solr multivalued field - Stack Overflow
-- but is still relevant to ES.

Essentially, when scoring a multi-valued field -- the length for the
fieldNorm is calculated as if the multiple values of the field are
concatenated together, instead of having a unique fieldNorm for each
value.

Setting the mapping for the field to {"omit_norms": true} seems to work:

gist:1868139 · GitHub

However, I'm not entirely sure exactly what the impact of that mapping
is, and why Simon didn't mention it in his response to that question:
lucene - Scoring of solr multivalued field - Stack Overflow

Interested to hear more from Lucene knowledgables.

clint

I'll give it a shot and report back. Thanks guys!

On 20 February 2012 23:54, Shay Banon kimchy@gmail.com wrote:

Yes, omitting norms would do the trick as well, and it makes more sense to
set it in this case.

On Monday, February 20, 2012 at 8:37 AM, Clinton Gormley wrote:

On Mon, 2012-02-20 at 07:24 +0100, Clinton Gormley wrote:

On Mon, 2012-02-20 at 16:56 +1100, Tal Rotbart wrote:

Hi all,

I'm facing a subtle scoring issue, which is no doubt Lucene related
but I'm wondering if there's a good ES solution to it.

A similar problem is summed up for 'that other Lucene search engine'
in this well put StackOverflow question:
lucene - Scoring of solr multivalued field - Stack Overflow
-- but is still relevant to ES.

Essentially, when scoring a multi-valued field -- the length for the
fieldNorm is calculated as if the multiple values of the field are
concatenated together, instead of having a unique fieldNorm for each
value.

Setting the mapping for the field to {"omit_norms": true} seems to work:

gist:1868139 · GitHub

However, I'm not entirely sure exactly what the impact of that mapping
is, and why Simon didn't mention it in his response to that question:
lucene - Scoring of solr multivalued field - Stack Overflow

Interested to hear more from Lucene knowledgables.

clint