How does elasticsearch schedule shards for query

I have a cluster with 11 nodes, and a table with 16 shard. Each shard has one replica. The allocation of the shards are like the follow. I added a number with ' for a replica. For example, 4 is the primary node, and 4' is 4's replica.

 node         shards
 0                5'   11'  12'
 1                5    13   14'
 2                1'     4'
 3                6     7     8'
 4                10'  14   15'
 5                 3'    6'    11
 6                 2     9'     12
 7                 0'     1      3
 8                 0 .    2' .    8
 9                 4 .    7' .   13'
 10               9 .    10 .   15

For a query, the profile API shows the query plan like the following

 node         shards
 1                5    14'
 3                6     7
 5                 11
 6                 9'     12
 7                 1     3
 8                 0 .    2' .    8
 9                 4 .    13'
 10               10 .   15

We can see that nodes 0, 2 and 4 are node used, and node 8 has 3 shards to query. I assume on the same node shards are queried in sequence, so 8 is like the critical path. Actually there exists a scheduling that ensures that each node has maximal 2 shards:

 node         shards
 1                5    14'
 2                1'     4'
 3                7 
 4                15'
 5                 6 . 11
 6                 9'     12
 7                 0'     3
 8                 2'    8
 9                 4 .   13'
 10              10

Why does not ES chose the 2nd one?

Also, the the scheduling ES profiler API shows at node 8 the shards 0, 2' and 8 have the top 3 profiled query time. Does ES also try to schedule expensive shard-query to different nodes? In my second schedule, we can see 0, 2' and 8 can be scheduled to different nodes.

Does ES have any other consideration so that it cannot schedule in the other way?

Thank you,

No, shards are queried in parallel to the extent the size of the thread pool allows.

I believe Elasticsearch by default picks shards in a round-robin fashion. You can alter this, e.g. by enabling adaptive replica selection.

Is use_adaptive_replica_selection new to only 6.x?

My 5.6 cluster does not have it.

{
  "error": {
    "root_cause": [
      {
        "type": "remote_transport_exception",
        "reason": "[elastic-master2][10.0.80.36:9300][cluster:admin/settings/update]"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "transient setting [cluster.routing.use_adaptive_replica_selection], not dynamically updateable"
  },
  "status": 400
}

About round robin, the document says

When executing a search, it will be broadcast to all the index/indices shards (round robin between replicas).

Does the 'replicas' include both primary and replica nodes? My cluster has only one replica for each primary. I am not sure if this is why it always picks the same copies.

For example, it always chooses copies 0, 2', and 8 from node 8 no matter I many profile APIs I run. Is it a limitation of profile?

Yes, it alternates between primary and replicas. Adaptive replica selection seems to have been added in Elasticsearch 6.1. One other way to affect which replicas that are selected is through the use of preference, although as far as I know there is no option for even distribution. If you have more than one query running in parallel, it is possible that the overall load will even out.

1 Like

My last comment about "always chooses copies 0, 2', and 8 from node 8" is incorrect. It actually switches between two schedules.

 node           shards
 0                5'   12'
 1                13
 2                1'     4'
 3                8'
 4                14
 5                 3'    6'    11
 7                 0'  
 8                 2' 
 9                 7' 
 10               9 .    10 .   15

and

 node         shards
 1                5    14'
 3                6     7
 5                 11
 6                 9'     12
 7                 1     3
 8                 0 .    2' .    8
 9                 4 .    13'
 10               10 .   15

I tried a few profiles. It does not pick others.

If it does round-robin w/o randomization, we can once it starts one scheduling, after round robin, the other one is the only choice. I guess if the total number of primary and replicas is N, we always have only N scheduling.

Does it mean how shards are allocated determines if shard querying can be evenly distributed? If my number of nodes can be divisible by the total # of copies, it could be better. For example, if I have 8 nodes, and shards can be allocated as the following?

node   shards
0          0 1 2' 3'
1          4 5 6' 7'
2          8 9 10' 11'
3          12 13 14' 15'
4          0' 1' 2 3
5          4' 5' 6 7
6          8' 9' 10 11
7          12' 13' 14 15

But I am not sure if we can control the allocation and selections?

It's a bit tricky to look at this on a search-by-search basis because the logic for routing searches to shards is quite involved and depends on various settings and bits of internal state. Roughly speaking the coordinating node decides which shards to search and then for each shard it decides which copy to search, in isolation from all the other shards. It does not try and perform each individual search completely evenly, because to do so would be expensive to compute and is not necessary: on a large enough sample from a busy enough cluster, searches will be distributed evenly. In a quiet cluster this may not be true, but it doesn't need to be true if the cluster is quiet.

Adaptive replica selection makes this more complicated, but aims to spread the load evenly (in terms of load, not number of queries). On a quiet cluster it may not distribute searches evenly but, again, it doesn't need to on a quiet cluster.

Search preference also certainly affects the routing of searches - that's its job - and another thing that's not been mentioned is shard allocation awareness which, if enabled, prefers to route searches to shards in the same awareness zone.

If you would like to know more details about what's going on under the hood, you could look at the OperationRouting class, particularly its preferenceActiveShardIterator method.

For what it's worth I tried your setup (16 shards, 1 replica, split across 11 nodes) and in an otherwise unused cluster I also saw the two-profile result you reported, although in my experiments the two profiles were completely complementary so the overall load was even.

4 Likes

Thank you for DavidTurner's and Christian_Dahlqvist's comments.

Based on your inputs, I feel that if I can split a shard to two smaller shard, then query can run on the shards in a more parallel way. I can imagine that it does not make sense to have too many small shards (https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster). My index size is only 150G with 16 shards. Maybe the size is good for now?

How many concurrent queries are you expecting to serve? Optimising for a single query may not make sense if this is not what you expect to see later on.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.