I hope this statement is still valid , and i confirmed this behavior too ..
What was the reasoning behind this to make the statistics per shard ( not per index) which defeat the purpose of distributed systems as when it supposed to be calculating the relevance in global by looking across all shards
In practise(considering realtime scenarios and distributed nature of elastic earch), the documents would spread across the nodes - i guess it is the whole purpose of distributed systems like ES.
However, the following statement might defeat the whole purpose of distributed nature
you would need to ship all the documents to a single node to relatively score them.
As Elasticsearch is a distributed system, you cannot guarantee that all shards of a given index will be on the same node when a query is processed. If you wanted to do it on an index level, you would need to ship all data from the index to a single node to calculate the relevance.
So you either have a single monolithic system to have index level scoring, or a distributed one with shard level scoring. There's costs and benefits to both.
ok, you mean it is more of operational complexity to do global relevance scoring across all shards. So - No other reason , not to do global relevance scoring
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.