Elasticsearch Shards/Indices planning

skat80 · November 21, 2018, 2:40pm

Hello,

While planning how to allocate shards in 3-nodes Elasticsearch cluster , I was totally overwhelmed.
Should anybody please check the following points and shed some light ?

What is indices ? I've read that documents inside shards are organized in indices.
By default Graylog defines 20 indices , what does that mean and how do they affect the amount of data that will be stored in elasticsearch node ?
In 3-node Elasticsearch scheme : 3 shards + 1 replica is a correct allocation ?
Also the necessity of replica is quite ambiguous . Why do I need a replica if my data will be indexed in a SAN drive which consists of multiple redundant SSD disks ?
Is there a specific formula how to calculate the combination of shards/indices/retention strategy ?

Any help would be appreciated,
thanks!

mjunaidmuzammil · November 21, 2018, 5:06pm

Hi,

I will try responding inline to your questions.

What is indices ? I've read that documents inside shards are organized in indices.

An ES index is a namespace to a collection of documents. If you are coming from relational db world, it has an analogy with the databases. ES index supports partitioning as an index can contain one or more shards as well as replication with configurable number of replicas. The benefit of partitioning or sharding is that data can be divided into multiple files and can be distributed across multiple nodes.

By default Graylog defines 20 indices , what does that mean and how do they affect the amount of data that will be stored in elasticsearch node ?

It simply means that there are 20 separate logical namespaces for holding data in ES. The maximum amount of data that can be stored in ES is independent of the number of indices, in fact it is linked to the number of shards. Since one index can have multiple shards, so the amount of data that can be stored in an index is proportional to the number of shards an index comprises of.

In 3-node Elasticsearch scheme : 3 shards + 1 replica is a correct allocation ?
Also the necessity of replica is quite ambiguous . Why do I need a replica if my data will be indexed in a SAN drive which consists of multiple redundant SSD disks ?

You can use the 3 node ES scheme with 2 nodes having Master + Data node roles and one node with Master role only. All 3 shards will be hosted on a single data node, but it will give you future scalability option of provisioning new data nodes and relocating some shards. Generally, what I have read is that using network based storage is not recommended with ES due to performance reasons. Network storages are slower as compared to physically attached disk drives. You need replicas if due to any node failure, some of the primary shards go down than the replicas are able to respond to requests.

Is there a specific formula how to calculate the combination of shards/indices/retention strategy ?

There isn't any simple formula and it depends upon the use case. Some simple considerations include.
i. Optimal shard size is 10-40 GB. Try keeping the shard size below 40G limit.
ii. There should be 25 shards / 1GB RAM on your ES data nodes.

Hope this helps you out.

skat80 · November 21, 2018, 6:54pm

Hello, thanks for your detailed answer.
Just two points to refer to :

Is it possible a node to have simultaneously the Master + Data role ?
According to your plan ,only one single data node will be hosted 3 shards and the remaining 2 nodes will have no shards ? Am I correct ?
How do you define the shard size ?

Thanks in advance!

mjunaidmuzammil · November 22, 2018, 5:29am

Is it possible a node to have simultaneously the Master + Data role ?

Yes, you can set node.data & node.master both as true for that.

According to your plan ,only one single data node will be hosted 3 shards and the remaining 2 nodes will have no shards ? Am I correct ?

Two data nodes will be hosting shards. There will be a total of (3 + 3 = 6) shards since you are setting number_of_replicas as 1. There are 2 Data + Master eligible nodes. These will be hosting shards (3 on each nodes), where as the 3rd one (Master only) will not have any shard allocations.

How do you define the shard size ?

By shard size, are you referring to number of shards in an index. We define number of shards in an index at the time of creating index or by using index templates. You need to do size estimations when you are creating it, how much data are you expecting it to hold. If you think an index will have a size of around 100GB, it would be better to allocate 3 or 4 shards. It is better to over allocate an extra shard initially when creating index, if you want to reduce it that can be done later

system · December 20, 2018, 5:29am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Newbie question on shard and replicas Elasticsearch	5	414	July 6, 2017
Shards and replicas Elasticsearch	16	1497	July 6, 2017
Shards or indices redistributed intelligence Elasticsearch	3	378	July 6, 2017
Relation between shards and nodes Elasticsearch	5	1987	November 22, 2017
3 Elasticsearch node cluster Elasticsearch	9	1037	July 6, 2017

Elasticsearch Shards/Indices planning

Related topics