Index-time boost in multi-valued fields


(Matthew A. Brown) #1

I came across some pretty crazy scoring behavior recently, where
certain matches on a field boosted at index-time had enormously high
field norms. After some illuminating discussion on the #lucene
channel, I tracked it down to this little nugget:

"The boost is multiplied by Document.getBoost() of the document
containing this field. If a document has multiple fields with the same
name, all such values are multiplied together. This product is then
used to compute the norm factor for the field."
(source: http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/document/AbstractField.html#setBoost(float))

So basically the index-time boost you specify is taken to the power of
the number of values in the field!

Since the whole concept of multi-valued field is more or less just
sugar in Lucene, might it make more sense for ES to take care of
concatenating the values in multi-valued fields and passing them as a
single value to Lucene? This would make the index-time boost behavior
better and I don't really see a downside.

Just a thought!

Mat


(Shay Banon) #2

There are downsides to it, for example, if its stored explicitly, or when
one does nested mappings, or faceting on the fields, or having them not
analyzed. In any case, its not planned to automatically concatenate the
values of multi valued fields into a single one.

On Fri, Apr 6, 2012 at 10:37 PM, Matthew A. Brown mat.a.brown@gmail.comwrote:

I came across some pretty crazy scoring behavior recently, where
certain matches on a field boosted at index-time had enormously high
field norms. After some illuminating discussion on the #lucene
channel, I tracked it down to this little nugget:

"The boost is multiplied by Document.getBoost() of the document
containing this field. If a document has multiple fields with the same
name, all such values are multiplied together. This product is then
used to compute the norm factor for the field."
(source:
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/document/AbstractField.html#setBoost(float)
)

So basically the index-time boost you specify is taken to the power of
the number of values in the field!

Since the whole concept of multi-valued field is more or less just
sugar in Lucene, might it make more sense for ES to take care of
concatenating the values in multi-valued fields and passing them as a
single value to Lucene? This would make the index-time boost behavior
better and I don't really see a downside.

Just a thought!

Mat


(Matthew Schulkind) #3

Is there any way to work around this? I'd like to be able to set the boost
for if one of the multiple values match, but not have it depend on the
number of values for the field.

On Saturday, April 7, 2012 11:27:31 AM UTC-4, kimchy wrote:

There are downsides to it, for example, if its stored explicitly, or when
one does nested mappings, or faceting on the fields, or having them not
analyzed. In any case, its not planned to automatically concatenate the
values of multi valued fields into a single one.

On Fri, Apr 6, 2012 at 10:37 PM, Matthew A. Brown mat.a.brown@gmail.comwrote:

I came across some pretty crazy scoring behavior recently, where
certain matches on a field boosted at index-time had enormously high
field norms. After some illuminating discussion on the #lucene
channel, I tracked it down to this little nugget:

"The boost is multiplied by Document.getBoost() of the document
containing this field. If a document has multiple fields with the same
name, all such values are multiplied together. This product is then
used to compute the norm factor for the field."
(source:
http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/core/org/apache/lucene/document/AbstractField.html#setBoost(float)
)

So basically the index-time boost you specify is taken to the power of
the number of values in the field!

Since the whole concept of multi-valued field is more or less just
sugar in Lucene, might it make more sense for ES to take care of
concatenating the values in multi-valued fields and passing them as a
single value to Lucene? This would make the index-time boost behavior
better and I don't really see a downside.

Just a thought!

Mat


(system) #4