Document score explanation values (maxDocs ?)


#1

Hi,

I'm querying some documents with two sort attributes (default score and one of the document attributes). I'm getting different scores for these docs and I cannot figure out why. Using the explain=true attribute, I'm able to see some different values, even though all documents look the same. For example, I'm querying for "1090", all documents have an id of format "xxxxxxxxxx.1090.xx", an attribute "ContractNumber: 1090" and one attribute "ElasticsearchKey" that contains the documents id. Besides those 3 matches, I cannot see any other "1090" in any other attribute. So, I was expecting to get the same score values, but there''s some score variations that I cannot understand.

With the query (the original query is not this one, this is just a "reduced" form to debug/reproduce the behaviour):

{
  "query": {
    "bool": {
      "must": {
        "query_string": {
          "query": "1090"
        }
      }
    }
  }
}

The results (6 documents) have different scores. On their "_explanation" attribute, we have different descriptions:

{ "description": "weight(_all:1090 in 26362) [PerFieldSimilarity], result of:"
"description": "weight(_all:1090 in 11206) [PerFieldSimilarity], result of:"
"description": "weight(_all:1090 in 67244) [PerFieldSimilarity], result of:"
[...]

What are those "in xxx" values ? According to the details attribute, it seems to be a "MaxDocs" attribute, but how's that calculated ? There's also other degrees of variation, some have 2 "details" attributes, others have more. Here's the full explanation responses, if anyone would care to see it: http://pastebin.com/jLvzPQgE (I've removed the other documents attributes, for clean-ness).

Any hints ?

Thank you


(Ivan Brusic) #2

The problem you are experiencing is due to distributed search. The IDF
values are calculated per shard, so scores can change depending on which
shard the document is located on. If you notice, the documents with the
same score are all on the same shard.

This problem normally manifests when you have a low number of documents and
a few or more shards. If you had millions of documents the problem will be
less.

One option is to use a distributed query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-search-type.html#dfs-query-then-fetch

There is a slight performance but, but it should help with the problem.

Cheers,

Ivan


#3

Ok, that makes sense. We'll evaluate the scenario & solutions.

Thanks.


(system) #4