Hi all,
I'm facing a subtle scoring issue, which is no doubt Lucene related
but I'm wondering if there's a good ES solution to it.
A similar problem is summed up for 'that other Lucene search engine'
in this well put StackOverflow question:
-- but is still relevant to ES.
Essentially, when scoring a multi-valued field -- the length for the
fieldNorm is calculated as if the multiple values of the field are
concatenated together, instead of having a unique fieldNorm for each
value.
I'm guessing behind the scene Lucene is indexing the multiple-values
as one big value. This penalises multi-value fields significantly.
A good example of why this is undesirable is in the above
StackOverflow question. Indexing people with multiple aliases:
Person 1: David Bowie, David Robert Jones, Ziggy Stardust, Thin White Duke
Person 2: David Letterman
Person 3: David Hasselhoff, David Michael Hasselhoff
Currently, searching for "David" Person 2 comes first and Person 1
comes in last.
Intuitively, I'd expect the opposite-- that searching for "David"
would bring up Person 1 or Person 3 first (for matching 2 values in
the field), and then the other of the two, and Person 2 should come in
dead last.
It seems Person 1 is penalised for having multiple field values which
mean that the field length is increased greatly as the aggregated
field value has the most tokens, and the fieldNorm suffers.
Is there some ES mapping option to prevent this from happening? Or
alternatively, a query DSL directive to prevent it?
The alternative workaround is to index each permutation as a separate
document and somehow group them in the query -- what is a good way of
doing this with ES?
Many thanks,
Tal