I have a 5 node cluster running elasticsearch 7.0.0.
Indices: 700
Total size: 350GB (Primary storage excluding replicas)
After watching the videos, you will probably understand that in general you can have around 50gb per shard (it depends as usual) and at most 20 shards per gb of RAM.
Here you have 1400 shards on 5 nodes. 280 shards per node. Which means that you need at least 14gb of HEAP
But with 50gb per shard, most likely only 7 primaries are needed, so 14 shards including replicas. Which is around 3 shards per node.
That would require probably less memory. I'd say that 8 gb of HEAP could be enough.
Again, it depends. So you need to test that against your own scenarii. Look at the resources I linked to.
I would like to add a few clarifications. As described in this webinar the amount of heap used will depend on how you have indexed and mapped your data as well as how large and optimized your shards are. Larger shards often use less heap per document than smaller ones, which is why using large shards are generally recommended for efficiency.
The rule-of-thumb of 20 shards per GB of heap comes from users often having far too many small shards and once you reach this limit the system is generally still working well. This recommendation is however a maximum number of shards and not a level you necessarily should expect to be able to reach. If you are using large shards I would expect the number of shards per node to be lower than the prescribed limit.
I sometimes hear the recommendation interpreted as "I should be able to have 20 50GB shards per GB of heap" which is not correct. If this was the case a node could have 1TB of data per GB of heap, which is generally very hard to achieve, at least without using frozen indices or extensive optimizations.
Each query is executed single-threaded against each shard, although multiple shard queries are run in parallel. The optimal number of shards for query performance therefore depends on the number of concurrent queries that compete for resources as well as whether it is CPU, heap or disk I/O that is limiting performance. Having multiple shards can often be faster than having a single one, but if you have too many shards performance is likely top start deteriorating. Best way to find out is to benchmark with as realistic data and queries as possible.
@dadoonet@Christian_Dahlqvist I tried to implement some of the suggestions for performance but I haven't noticed a considerable performance improvements. Can you please tell me if am doing something wrong ? The new configurations as the compared to the old ones described above are:
That still sounds like far too many shards given the size of the data. If you had s total of 10 indices with a single primary shard each the average shard size would be 14GB which is quite reasonable.
I reduced the number of primary shards to 70 (I can still reduce, but i have time based indices and i would like some granularity). But I'm not seeing any performance improvements while querying. Instead the the query time increased a bit.
Then i increased replicas from 1 to 2. The query time improved a bit. But increasing replicas is not preferred choice for me as the store size increases.
I have 8 CPU cores for each node. Which i think is enough.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.