Maybe that hit comes from a different shard? Have you tried setting search_type parameter to dfs_query_then_fetch? This would compute the distributed docFreq so when all shards work with the same docFreq when computing the score.
The maxDocs is only referering to documents in the same shard as the hit came from.
This has a significant impact on the final score in my test case where I only have a few documents indexed in my cluster and doing a multimatch with two search terms.
The most relevant document where both search terms is represented in the searched field, is only presented as hit number 3, even though all logic says that it should be numer 1. Hit number 1 and 2 does only include the first search term.
This happens because the documents is indexed on 3 different shards and one of the search terms only appears in 1 document on each shard and at the same time the number of documents on each shard differs from 9 to 31.
This of course gives a difference in the overall scoring.
Shouldn't the maxDocs in the optimal setup be calculated at cluster level??
In developments phases some might not have more than 100 documents in their cluster so the learning must be to have closer to the expected number of documents in the production environment (or at least "many" documents on each shard) before trying to tweak the search results.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.