Surprising scoring when using multi_match's cross_field


(Christoph Lingg) #1

Hello!

I am using the multi_match's cross_field
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html#type-cross-fields query.
It works very well and is exactly what I need. However, in some rare
circumstances the order of the results doesn't match my expectations. It
turns out that the scoring of the first results is much higher than the
score of the rest of the documents. I had a closer look at the explain
statements and was surprised by the way the scores were calculated:

for the first doc:
{
"value": 8.252264,
"description": "fieldWeight in 998806, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 8.252264,
"description": "idf(docFreq=13182, maxDocs=18605118)"
},
{
"value": 1,
"description": "fieldNorm(doc=998806)"
}
]
}

and for the doc that is supposed to be first:
{
"value": 3.8485851,
"description": "score(doc=700068,freq=1.0 = termFreq=1.0\n), product
of:",
"details": [
{
"value": 0.46578622,
"description": "queryWeight, product of:",
"details": [
{
"value": 18,
"description": "boost"
},
{
"value": 8.262557,
"description": "idf(docFreq=13047, maxDocs=18605118)"
},
{
"value": 0.0031318406,
"description": "queryNorm"
}
]
},
{
"value": 8.262557,
"description": "fieldWeight in 700068, product of:",
"details": [
{
"value": 1,
"description": "tf(freq=1.0), with freq of:",
"details": [
{
"value": 1,
"description": "termFreq=1.0"
}
]
},
{
"value": 8.262557,
"description": "idf(docFreq=13047, maxDocs=18605118)"
},
{
"value": 1,
"description": "fieldNorm(doc=700068)"
}
]
}
]
}

You can see that the queryWeight factor is missing in the calculation of
the first doc, which leads to a much higher total score. I am no expert in
it, but this seems to be a bug in my eyes. Or did I misunderstand something?

You can find the query and the result list here:

Thanks for your help!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3526746f-d8a9-4344-978a-9240bbd38a13%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Christoph Lingg) #2

Another effect I do not understand ist the queryNorm which differs between
documents, reading the documents I assumed them to be constant.
From the lucene documentation
http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/search/Similarity.html
:

queryNorm(q) is a normalizing factor used to make scores between queries
comparable. This factor does not affect document ranking (since all ranked
documents are multiplied by the same factor), but rather just attempts to
make scores from different queries (or even different indexes) comparable.

this is from the scoring of the first results
{
"value": 0.0059806756,
"description": "queryNorm"
}

others have:
{
"value": 0.0031318406,
"description": "queryNorm"
}

as the queryNorm finally does affect document ranking i am asking myself
if i am doing something wrong ...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d258c287-a758-4165-8d0f-46a8bb950b8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Christoph Lingg) #3

hm, after some investigations it turns out that queryNorm is related to the
shard. I observed that only one of the five shard has a different query
norm, all the others have equal ones. I will retry with only one shard to
see if things are getting better.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9e96308e-ba99-499c-9f8e-64f2e7b088b6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Stephane Bastian) #4

Christoph,

I'm wondering if the problem comes from the 'query type' parameter?


Have you tried dfs_query_then_fetch? Does it make any difference?

All the best,
Stéphane

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7b7745a4-5b6d-4b49-8c5b-d37e25619026%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Christoph Lingg) #5

Hi Stephane,

yes I did try it but the results did not change. However, I reduced the
number of shards from 5 to 1 and now the queryNorm is the same for every
document. I learned that every shard is an independent lucene index and
therefore different weights are likely to occur.

However, the first strange behavior (missing queryWeight factor) still
occurs from time to time, gracefully not too often.

Cheers,
Christoph

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3bb3bdc9-9709-410d-8609-7c276657ed67%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6