Lets say I have 100 node cluster, each with 16G heap (50% of the RAM). I
have single index, its 1G, and I know it won't be growing much (will never
grow above the heap size).
Is having single shard better than having 5 by default? Then to spread
the load will run 99 replicas
If index size < heap size, is all that heap memory wasted? I mean lets
say I have 5G index, would it be better to have 4 nodes with 5G heap or 2
with 10G heap?
1GB is small, it's very easy to make it fit entirely in the filesystem
cache. For this kind of small index where you want to optimize the search
throughput, just have one shard per index and have it replicated once per
node (option 1 that you described).
Regarding 2, indeed with such a small index you will probably not need 16GB
per machine. Having several nodes per machine would not help since the
bottleneck would be CPU (not disk since everything fits in the FS cache and
not memory since you have much more memory than your index size) and a
single node can already make use of all your CPUs.
For more general considerations about shard sizing, the following chapter
of the reference guide gives practical advice around picking up the right
shard size and number of shards:
Lets say I have 100 node cluster, each with 16G heap (50% of the RAM). I
have single index, its 1G, and I know it won't be growing much (will never
grow above the heap size).
Is having single shard better than having 5 by default? Then to spread
the load will run 99 replicas
If index size < heap size, is all that heap memory wasted? I mean lets
say I have 5G index, would it be better to have 4 nodes with 5G heap or 2
with 10G heap?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.