I've been digging into Elastic's mechanism of shard distribution and I'd like to validate my understanding to see if I got something wrong or if there is something missing.
With the aim of achieving a higher degree of parallelism inside a cluster, while still avoiding resource competition among the shards, would the statement below be correct?
In a cluster with 3 indexes, each index has 5 primary shards, the minimum number of cores that this cluster should have is 3 * 5 = 15 cores, regardless if that would lead to an architecture where the cluster would have a single node with 15 cores or 15 nodes with 1 core on each node.
I see, thanks. And in this case the number of replica and data shards wouldn't impact the ideal number of cores, correct? When deciding the number of cores in a cluster, the only type of shards that should be taken into account is really the primary ones, right?
The ideal number of shards depends a lot on the workload and latency requirements, and the number of primary and replica shards both play a part here. It also depends on the number of concurrent queries you are optimising for.
I'm less concerned about the ideal number of shards and more about the relationship between the total number of shards in my cluster and the total number of cores that the cluster should have, in order to accommodate every shard (optimized for parallelism) without burning money.
Let's say that for a fact my one-indexed cluster needs exactly 15 primary shards. Does that mean that I need exactly 15 cores to achieve the sweet spot, or in this calculation I should also taken into account other factors such as, for instance, the number replica shards?
There is no direct link between shards and cores. You can run a node with 15 shards on a VM with a single core, although that may not be optimal for performance. When you run a query against a number of shards, each shard is processed in a single thread, although different shards naturally are being queried in parallel. The more cores you have, the more shards can be processed in parallel. If you have more cores than shards on a node, some may be idle unless you have more than one query running concurrently.
Exactly what was I looking for. If I understood correctly then:
A cluster with more cores than shards wouldn't necessarily be a problem per se, since the load could be such that no cores would be idle, and in that scenario two or more cores (each with its own single thread) could be reading the same shard, in the same node at same moment, but for different queries.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.