Hey there! I have one particularly high throughput logging cluster which processes about 1tb of data per day. Most of that data goes into a single index with 25 shards (which exactly matches the number of nodes in the cluster, intentionally).
I was looking at the cluster today and I noticed that the primary shards aren't actually distributed evenly across nodes in the cluster - some nodes have 2 primary shards, and some nodes have 0. Wouldn't the best configuration for maximum indexing speed be to have one primary shard per node?? There are almost no other relevant indices on that cluster.
Is there a setting I can set to tell elasticsearch not to allow two primary shards from the same index to live on the same node??
Nodes with primaries handle the indexing request and they follow-up with other nodes to ensure the replication completes. So, in a way they take more n/w resource and communication burden. No?
Nodes with primaries handle the indexing request and they follow-up with other nodes to ensure the replication completes. So, in a way they take more n/w resource and communication burden. No?
Technically yes, but practically no The communication burden is often negligible. There are only very rare cases where primary balance would bring a little bit more performance. Establishing and keeping primary balance comes at a cost as well though, as more shard shuffling needs to be done by the balancer when a node fails.
Wait so as long as the TOTAL of primaries and replicas for an index is distributed evenly across all my nodes, I can consider my cluster to be essentially balanced is what you are saying??
Also note that neither of the balancing properties I've just mentioned distinguish between primary or replica shards, they're treated as equal w.r.t. balancing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.