I have a cluster with 11 nodes, and a table with 16 shard. Each shard has one replica. The allocation of the shards are like the follow. I added a number with ' for a replica. For example, 4 is the primary node, and 4' is 4's replica.
We can see that nodes 0, 2 and 4 are node used, and node 8 has 3 shards to query. I assume on the same node shards are queried in sequence, so 8 is like the critical path. Actually there exists a scheduling that ensures that each node has maximal 2 shards:
Also, the the scheduling ES profiler API shows at node 8 the shards 0, 2' and 8 have the top 3 profiled query time. Does ES also try to schedule expensive shard-query to different nodes? In my second schedule, we can see 0, 2' and 8 can be scheduled to different nodes.
Does ES have any other consideration so that it cannot schedule in the other way?
When executing a search, it will be broadcast to all the index/indices shards (round robin between replicas).
Does the 'replicas' include both primary and replica nodes? My cluster has only one replica for each primary. I am not sure if this is why it always picks the same copies.
For example, it always chooses copies 0, 2', and 8 from node 8 no matter I many profile APIs I run. Is it a limitation of profile?
Yes, it alternates between primary and replicas. Adaptive replica selection seems to have been added in Elasticsearch 6.1. One other way to affect which replicas that are selected is through the use of preference, although as far as I know there is no option for even distribution. If you have more than one query running in parallel, it is possible that the overall load will even out.
If it does round-robin w/o randomization, we can once it starts one scheduling, after round robin, the other one is the only choice. I guess if the total number of primary and replicas is N, we always have only N scheduling.
Does it mean how shards are allocated determines if shard querying can be evenly distributed? If my number of nodes can be divisible by the total # of copies, it could be better. For example, if I have 8 nodes, and shards can be allocated as the following?
It's a bit tricky to look at this on a search-by-search basis because the logic for routing searches to shards is quite involved and depends on various settings and bits of internal state. Roughly speaking the coordinating node decides which shards to search and then for each shard it decides which copy to search, in isolation from all the other shards. It does not try and perform each individual search completely evenly, because to do so would be expensive to compute and is not necessary: on a large enough sample from a busy enough cluster, searches will be distributed evenly. In a quiet cluster this may not be true, but it doesn't need to be true if the cluster is quiet.
Adaptive replica selection makes this more complicated, but aims to spread the load evenly (in terms of load, not number of queries). On a quiet cluster it may not distribute searches evenly but, again, it doesn't need to on a quiet cluster.
Search preference also certainly affects the routing of searches - that's its job - and another thing that's not been mentioned is shard allocation awareness which, if enabled, prefers to route searches to shards in the same awareness zone.
If you would like to know more details about what's going on under the hood, you could look at the OperationRouting class, particularly its preferenceActiveShardIterator method.
For what it's worth I tried your setup (16 shards, 1 replica, split across 11 nodes) and in an otherwise unused cluster I also saw the two-profile result you reported, although in my experiments the two profiles were completely complementary so the overall load was even.
Thank you for DavidTurner's and Christian_Dahlqvist's comments.
Based on your inputs, I feel that if I can split a shard to two smaller shard, then query can run on the shards in a more parallel way. I can imagine that it does not make sense to have too many small shards (https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster). My index size is only 150G with 16 shards. Maybe the size is good for now?
How many concurrent queries are you expecting to serve? Optimising for a single query may not make sense if this is not what you expect to see later on.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.