I am wondering what is the advantage of having primary shards spread amongst several nodes. Since you can use multithreading, it is not more effective to handle all the details of the same search query from the same node?
Which would be the consequences on designing a policy that concentrates all the primary shards on the same node? Bear in mind that each index will go to a different node and you will keep you replicas away in another node.
I'm not sure if that will work because in this use case it is needed to throw the query in all the shards to have an accurate result. I was more looking for a sharding policy that puts all the primary nodes on the same node when possible. Maybe it's best to not use shards at all.
It will query all shards even if some of them are not available locally. It will just have a preference for the local ones.
But I'm wondering if it's a just question you have or a real problem?
I mean that you should not really try to change the default behavior unless you have a real problem. Do you?
I think you are perhaps confused about the role of primary shards. From the point of view of a search they're just another shard copy, no different from any other replica. The main difference between primaries and replicas is that primaries perform a small amount of extra coordination when indexing a document. Using _preference=local will try and keep a search on the local node regardless of whether the shards on the local node are primaries or replicas.
An algorithm that keeps the N shards of an index in the same node works for us. Even if all are primary, replica or mixed.
I guess that for recovery purposes I would keep all the primaries together in one node, and its replicas together in another. So the probability of losing the data is less high.
So from that point of view, my interest has nothing to do with primaries or replicas, just with the fact of performing the search in multithread in the same node instead of multithread across several nodes. But anyway, maybe Elastic is designed so the latencies are negligible for this situation.
The primary/replica distinction doesn't make any difference in terms of data durability either.
If CPU were the limiting factor then it probably would be better to stay on a single node using lots of threads, but there are other things like I/O bandwidth that don't scale with the number of processors and this is simplistically why it can be better to use multiple nodes. Network latency isn't normally a major concern, but if it is then you can use shard allocation awareness to try and keep searches within a single zone (e.g. node, or rack) where the latency is best.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.