How many shards should I create in terms of an index across 20 data node?

Robin_Guo · June 14, 2017, 9:06am

Recently, We built an es cluster with 20 data nodes that has 32G mem and 32core cpu .
Anyone has suggestions that How many shards should I create in terms of an index
across 20 data node?

Please let me know if you need more detail specs.

Thanks in advance

dadoonet · June 14, 2017, 10:16am

Why did you build that cluster of that size?

Normally you do the opposite. Check how many docs a single shard can have, how many shards a single node can have, then find the number of nodes you need.

Also different strategies may apply. Depending on if you are doing time based data or not.

Here without any other information I'd say that with 20 nodes, 10 shards and 1 replica. But again that's probably not going to fit your needs.

Kumar_Narottam · June 14, 2017, 11:14am

Hi Robin,

The average size of one elastic shard (primary or replica) should not be more than 80 GB max. It helps you in quick initialise and allocation of shards among nodes when you restart any node or in case of any failure.

The number of shards depend on how much data you have right now and what would be data growth rate in next 2 years.

EG: if your data volume is 1.2 TB and data growth is 20 percent an year. that makes your total planned volume as 1.7 TB. Hence you should go with 20 primary nodes and 1 replica at least.

If you have ample resources and memory you can offer yourself 2 replicas. The right guess of primary shards at start of index creation is vital as you cannot change this number once index is created.

Another suggestion : Always use elastic index alias name in your query, so that you can change underlying index name as per your wish, It gives you flexibility of re-indexing data at ease.

You can have a good read at below link. Hope it helps

Christian_Dahlqvist · June 14, 2017, 11:21am

What is your use case? How much data do you envision having?

Robin_Guo · June 23, 2017, 4:27am

Hi @dadoonet , @Christian_Dahlqvist , @Kumar_Narottam

The thing like this that We have total 20 es data node. each of them has 32G mem and 32core CPU and 4.6TB storage.
we want to keep a year metrics and 6 months logs alive on those whole es data clusters.

We probably need to index around 200GB data for metrics and 100~150GB data for logs per day, more or less, over time,
likely every a year probably the 20% of volume will be increased.

so we're really want to make a good shard allocation across the whole es data nodes for performance,
whatever for read, write or query from es data nodes.

our architecture like this:

beats -> Haproxy cluster(2) -> logstash cluster(3) -> es cluster {master(3),client(3),data(20)}->dashboard{grafana,kibana}

for beats, we're using metricbeat for collecting the metrics, and filebeat for collecting the logs.

The Haproxy is using for shipping the metrics/logs to logstash cluster,

The logstash is using filter and modify the original data and then send it to es data nodes.

The client node is use for query data from dashboard like grafana and kibana.

The grafana is using to visualize the metrics .

The kibana is using to visualize the logs.

I'm looking forward to your replying.

Christian_Dahlqvist · June 23, 2017, 5:30am

Store data with different retention periods in different indices. In order to be able to hold as much data as possible on your nodes, aim to have an average shard size between a few GB and a few tens of GB. As the amount of data indexed per day is quite low for a cluster that size, I would be surprised if you had any issues even if the write load was not perfectly distributed, so you do not necessarily need to create 10 primary shards for every index in order to make sure a shard ends up on every node for every index.

If you still want to spread the indexing load and therefore risk ending up with a large number of small shards, the shrink index API can help you reduce the number of shards once you have completed indexing. You could e.g. index into 20 primary shards and then shrink the index down to 5 primary shards, which will increase the average shard size for the longer term storage of the index.

Another way to ensure you do not get too many small indices is is let each index cover a variable period of time and instead optimise if for a specific shard size in terms of documents. This is something the rollover API can help you with. You can determine how many documents gives a good index/shard size for a specific index type and then target this document count ;per shard using this API. This means that a particular index may cover a a number of days while another index with less data flowing in could cover a couple of weeks.

system · July 21, 2017, 5:30am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to allocate ES data shards among 20 es data nodes effectively? Elasticsearch	5	468	October 23, 2017
Create/delete index goes slow when shards number up to 10000 Elasticsearch	11	1483	April 24, 2017
Large shard size Elasticsearch	4	408	December 4, 2021
Elasticsearch Shards Elasticsearch	5	701	August 22, 2017
Trying to optimize Elasticsearch cluster Elasticsearch	3	976	February 20, 2017

How many shards should I create in terms of an index across 20 data node?

Related topics