To be a little less hand-wavy (please correct me if I'm wrong): some
stats used in the scoring, like IDF, are computed per shard, by
default. These stats are effectively computed only from the document
set present in that one shard. This means that the same document can
be scored differently, depending on which shard it ends up in.
By changing the search-type, you can change this behaviour so that the
stats are computed on index-level (not shard-level), i.e. from the
document set present in the entire index. This helps to score
consistently within one index.
AFAIK there is no way to run cross-index queries accurately. You can
rely on the "evening out" that Clinton mentions. In that case you need
to be careful your routing doesn't skew the stats distribution too
much -- if each shard receives very different data, then the stats
will never even out. The default routing is fine, as it sends out
documents to random shards evenly (using hash of the id field).
By changing the search-type, you can change this behaviour so that the
stats are computed on index-level (not shard-level), i.e. from the
document set present in the entire index. This helps to score
consistently within one index.
Not just at the index-level, but for all the shards involved in your
query. So if you're doing a multi-index search and you use
search_type=dfs_query_then_fetch
then it will fetch the term frequencies from all shards (from all
indices in your query) before executing it.
AFAIK there is no way to run cross-index queries accurately. You can
rely on the "evening out" that Clinton mentions. In that case you need
to be careful your routing doesn't skew the stats distribution too
much -- if each shard receives very different data, then the stats
will never even out. The default routing is fine, as it sends out
documents to random shards evenly (using hash of the id field).
Sure, but for typical use cases, you'll be routing on (eg) a client, and
searching within just that client, so terms will be evenly distributed
for that client.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.