I'm wondering if there's a known formula for choosing when to have many indexes with, say, one shard each vs. one index with many shards, or some point in between, for the case where it's trivial to determine which node/shard on which to index/query data (i.e., it's chunkable).
Specifically, my data is routed via a (spatial + data type) hash, which is easy to divide across indexes and/or shards.
Is there any technical advantage/disadvantage to, say, 1-index-64-shards vs. 64-indexes-1-shard? AFAIK, increasing the node count would auto redistribute the shards either way.
This would include scaling, memory, and performance considerations.
Not in my experience. And down at the Lucene level the two cases are pretty much the same: An Elastic shard is simply a Lucene index. So whether you have 46 shards in one big Elastic index or 64 Elastic indices with 1 shard each you still end up with 64 Lucene indices. There could be some differences in cluster state, but I'm not sure what.
For me, when designing cluster indices, I first focus on the 20-40 GB shard size rule. Let's say I need daily indices to store around 30 GB data, then I put 1 shard in its index template. While if I need weekly indices of 200 GB data I'll use between 5 and 7 primary shards in the template.
Another thing to consider is the maximum number of shards per node, which can be computed from the 20 shards per GB of heap space rule. If I assign 8 GB of Java Heap Space to each data node I need to make sure that each node has fewer than 8 * 20 = 160 shards (primary + replica). If I get close to this number I simply add more data nodes to spread the shards more thinly (or I could increase the heap space, but that would add more work load to each data node so I prefer the first solution).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.