I'm interested in getting accurate scores for queries that span multiple indices with distinct document types. I understand today I can use dfs_query_then_fetch to ensure the document frequencies are relevant to the whole corpus of documents, not just each shard.
How is this affected with the switch to BM25? Would dfs_query_then_fetch solve the same problem on queries across indices scored by BM25? Are there other terms of the calculation that need pre-querying in BM25 like document frequency needs in TF-IDF?
Even with dfs_query_then_fetch, the values are still only calculate per
index, so it will not solve your problem for a multi-index search.
In the single index case, I think dfs_query_then_fetch is still beneficial.
BM25 will saturate the TF values sooner, but the value would still be
calculated per shard without it. Usually it just takes large indices to
have better sharded TF values.
The Lucene BM25 parameters deal with term frequencies, not document
frequencies.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.