Adaptive Replica Selection - deeper details

Hello,

If possible, I'd like to learn more specifically how shard ranking is calculated for a search.

The current documentation for Adaptive Replica Selection says

By default, Elasticsearch uses adaptive replica selection to route search requests. This method selects an eligible node using shard allocation awareness and the following criteria:

  • Response time of prior requests between the coordinating node and the eligible node
  • How long the eligible node took to run previous searches
  • Queue size of the eligible node’s search threadpool

Do the first and second items have some overlap? Or is the first item the communication latency of the inter-node request and the second item a metric provided by the data node for the duration of search? Or are they something else that could be more precisely expressed? (Also, as stated, the first item seems to include latency of indexing requests, too. Is that the case?)

Are the first two items scoped down at all? That is, does the coordinating node have those criteria on a shard-by-shard or index-by-index basis, or is it for all requests sent to the candidate nodes?

If a coordinating node has determined that there are 8 participating shards for a given search, as it iterates through the shards determining which shard copy of each should be searched, does the selection for the first 7 influence the choice for the 8th (by, say, incrementing the apparent queue size for the destination nodes)? Or maybe it doesn't have to because those searches have been dispatched already and the search queue size will already reflect the new state?

Thanks in advance for any insight you can provide.

Hi Glen,
I think this blog post pretty much answers your questions :
[Improving Response Latency in Elasticsearch with Adaptive Replica Selection | Elastic Blog].
I don't think there is a shard-level ranking (which primary or replica of a specific shard is the "best performing" among my cluster) but rather a node-level ranking, based on statistics of previously dispatched searches. And AFAIU only search requests are taken into account (not indexing requests, which are not necessarily on par with search requests' performance).
When I want to troubleshoot unbalanced search load on the cluster, I use GET _nodes/stats/adaptive_selection which returns the metrics and node ranking for ARS.

Beautiful! Thanks for the details!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.