I am creating an elastic cluster with 2.5TB dataset.
Questions:
What is the best way to design shards for best performance?
The only limit is 2.1Billion documents per shard (it's Lucene core limitation), 20 - 40GB is soft limit. Is it still the case with higher versions of Elastic?
What would be the optimal configuration of hardware for this dataset? Is there any benchmarking provided by elastic search?
The optimal shard size depends a lot on your use case and latency requirements. Each query fans out across involved shards in aparallel, but each shard is processed in a single thread, which means minimum query latency tends to increase with shard size. Exactly how much depends on your data, hardware as well as how you query it. This is descibed in this old Elastic{ON} talk. Even though it is quicker to query a small shard than a big, you generally want to have as large shards as possible as splitting data into a lot of very small shards can give very bad performance as there will be a lot of tasks to be performed.
The recommendation is therefore to benchmark based on realistic data and query types and volumes to get the right settings for your use case and hardware. Indexing also competes for the same resources, so make sure you always benchmark using a realistic combined load.
Thanks for the your valuable comment. Can I find hardware benchmarking of elastic somewhere to get an insight about how to choose hardware with larger datasets?
@Christian_Dahlqvist, A good rule-of-thumb is to keep the number of shards per node below 20 per GB heap. A node with a 30GB heap should, therefore, have a maximum of 600 shards. This will generally help the cluster stay in good health.
What this means is if I spin 16C/64GB machine as one node and give 30 GB for heap, I can put maximum 600 shards.
The blog post you linked to assumes a logging use case with limited querying, so may not apply as your use case is quite different.
Supporting 1000 concurrent queries that all hit the full dataset will put your cluster under serious load. I would recommend setting up 3 dedicated master nodes, which can be smaller than the data nodes. With respect to data nodes I would expect you to need a considerably larger cluster than you outlined as you either will need to increase the number of replicas used to handle the query throughput or hold quite limited amounts of data per node to ensure you benefit from caching.
I recommend you look at the links I provided. I do not think anyone can give you the answers you are looking for, so you need to test and benchmark your use case.
@Christian_Dahlqvist, thanks for your help really appriciate it. Could you please help me with my another query.
I've an index (200GB) with 1 shard on a EC2 instance with 8vCore and 32GB RAM. The search query time with this configuration is 5 second. If I update the system configurations to 16vCore and 64GB RAM with same shard configuration.
Each shard included in a query is processed in a single thread so if you have only a single primary shard you will not be able to use the threads you already have available so increasing CPU should not improve performance of an individual query although you may be able to handle more concurrent queries.
Adding RAM might allow the OS to cache more efficiently which could improve performance.
As your data set is larger than The OS page cache it is likely you may be limited by disk I/O so that is the first are I would check.
I would recommend splitting the index into a larger number of primary shards and ensure you have the fastest storage possible.
@Christian_Dahlqvist, Thanks for your super fast reply! Could you please tell me what is the impact of the number of documents in a shard vs storage size of document in a shard?
I do not understand the question. I would expect the comparison to be between an index with 1 primary shard, 20M documents and 200GB in size and another with 4 primary shards 5M documents per shard and roughly the same total size.
The result of the comparison will depend on what is limiting performance. If it is CPU I would expect a significant improvement but if it is something else the difference may be small.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.