I am curious to know the things, apart from shard count, elasticsearch looks at while allocating shards to a node. I am mainly interested in knowing if it tries to maximize the diversity of shards available on any randomly chosen set of nodes?
In my context, we have an index with 7 shards and 3 replicas deployed on a 11 node cluster. The discussion on why I have kept the number to be what they are is for another day. So there are 28 shards in all. Given that 28/11=2.54, I expect all the nodes to contain either 2 or 3 shards. This is true in my case.
Now let's talk about the diversity of shards available on a chosen number of nodes. If I look at 5 nodes, I expect the shard count to be between 10 and 15. The actual count is 12, which is consistent with the calculation so far. Now that 12/7=1.71, I expect 1 or 2 copies(2 copies each for 5 shards and 1 copy each for 2 shard) of all the shards on these 5 nodes if ES also maintains diversity. This is not what I find to be the case. I see 3 copies for 1 shard, 2 copies for 3 shards and 1 copy for 2 shards. Any idea if the homogeneity in distribution I am looking for is justified or not? any related documentation?
I'm no expert on Elasticsearch shard allocation but I have made some observations over the years which may answer your question.
What I've noticed through versions 5.x up to 6.2.2 (which is my current version) is that when allocating shards, Elasticsearch will treat primary and replica shards the same. And to a certain degree they are, the main difference is that the original write operation goes to the primary shard and only when that has succeeded to the replica(s).
When balancing the cluster, Elasticsearch will try to spread out the shards evenly, as you noticed, and it will do so without treating primary and replica shards differently. The reason, I surmise, is simply this: A replica can in an instant be promoted to a primary shard, if the node with the primary falls out of the cluster. So, if Elasticsearch had treated primary and replica shards differently it would have had much more work to do when a node falls out, instead it just has to promote new replicas on other nodes to primaries, without moving things around to even out the distribution of primaries.
In my clusters, for instance after doing a rolling upgrade, I can often find one node with only replica shards while another may have only primary shards. But this has never been a problem since the replica is just as big as the primary, so as long as each node gets the same number of shards the cluster will be well balanced with roughly the same amount of data (# documents) on all nodes.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.