Max shard size for a very large single index

I am planning to develop an elasticsearch cluster to stand up a very massive index. The main usecase is to index documents and search using keywords.

We estimate the size of the index to be 16TB. So if I keep single shard size to 50GB ( I read 50GB is ideal) , I am looking at 320 primary shards. Initially it will be 9TB and expected to be 16TB in next 5 years. And with at least 1 replica, I am looking at 640 shards. How many nodes do I need to support these many shards ?

I am no expert but would depend on how much heap each node has.

In the research I have done, it equates to ~20 shards per GB of heap but varies by some other factors.

IE if a node has 8GB heap ideal sizing would be 160 shards. I generally add a failover node as well so:

4+1 nodes @ 8GB heap
2+1 nodes @ 16GB heap
etc.

As far as search performance goes that is another story.

But to get resilience to fail overs, we need to have enough number of nodes for replicas, right ?

The guidline of 20 shards per GB of heap is a maximum, not an ideal. If you have quite large shards I would expect each node to have considerably fewer shards than that.

Each query is executed ina single thread against each shard, but multiple shards can naturally be processed in parallel. This means that the minimum latency will depend on the size of the shard as well as the data and mappings used. 50GB is often used as a reasonable staring point, but the ideal shard size may be different for your use case (smaller or larger).

The number of nodes you need to hold your data will depend on query latency throughput and requirements as well as heap usage. If your data set is larger than what can be cached in RAM the performance of your disks will be important and affect latency. The more data a node holds the more I/O is typically required to serve a query.

For this tpe of use-case I would recommend running some tests and benchmarks to determine how large shards you should have and how much data each node can hold.

Thank you for the response. Since my application deals with searching on large datasets, I would not want to run more than 3 shards per node. I do have another question, generally in a cluster that involves very large number of nodes, is it better to have dedicated master nodes, data nodes and co-ordinating node ( I may not have inject nodes so ignoring) ?

Is there an official documentation to decide how to split the nodes ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.