Elasticsearch Shards/Indices planning


(George) #1

Hello,

While planning how to allocate shards in 3-nodes Elasticsearch cluster , I was totally overwhelmed.
Should anybody please check the following points and shed some light ?

  1. What is indices ? I've read that documents inside shards are organized in indices.
    By default Graylog defines 20 indices , what does that mean and how do they affect the amount of data that will be stored in elasticsearch node ?

  2. In 3-node Elasticsearch scheme : 3 shards + 1 replica is a correct allocation ?
    Also the necessity of replica is quite ambiguous . Why do I need a replica if my data will be indexed in a SAN drive which consists of multiple redundant SSD disks ?

  3. Is there a specific formula how to calculate the combination of shards/indices/retention strategy ?

Any help would be appreciated,
thanks!


(Junaid) #2

Hi,

I will try responding inline to your questions.

  1. What is indices ? I've read that documents inside shards are organized in indices.

An ES index is a namespace to a collection of documents. If you are coming from relational db world, it has an analogy with the databases. ES index supports partitioning as an index can contain one or more shards as well as replication with configurable number of replicas. The benefit of partitioning or sharding is that data can be divided into multiple files and can be distributed across multiple nodes.

By default Graylog defines 20 indices , what does that mean and how do they affect the amount of data that will be stored in elasticsearch node ?

It simply means that there are 20 separate logical namespaces for holding data in ES. The maximum amount of data that can be stored in ES is independent of the number of indices, in fact it is linked to the number of shards. Since one index can have multiple shards, so the amount of data that can be stored in an index is proportional to the number of shards an index comprises of.

  1. In 3-node Elasticsearch scheme : 3 shards + 1 replica is a correct allocation ?
    Also the necessity of replica is quite ambiguous . Why do I need a replica if my data will be indexed in a SAN drive which consists of multiple redundant SSD disks ?

You can use the 3 node ES scheme with 2 nodes having Master + Data node roles and one node with Master role only. All 3 shards will be hosted on a single data node, but it will give you future scalability option of provisioning new data nodes and relocating some shards. Generally, what I have read is that using network based storage is not recommended with ES due to performance reasons. Network storages are slower as compared to physically attached disk drives. You need replicas if due to any node failure, some of the primary shards go down than the replicas are able to respond to requests.

  1. Is there a specific formula how to calculate the combination of shards/indices/retention strategy ?

There isn't any simple formula and it depends upon the use case. Some simple considerations include.
i. Optimal shard size is 10-40 GB. Try keeping the shard size below 40G limit.
ii. There should be 25 shards / 1GB RAM on your ES data nodes.

Hope this helps you out.


(George) #3

Hello, thanks for your detailed answer.
Just two points to refer to :

  1. Is it possible a node to have simultaneously the Master + Data role ?
    According to your plan ,only one single data node will be hosted 3 shards and the remaining 2 nodes will have no shards ? Am I correct ?

  2. How do you define the shard size ?

Thanks in advance!


(Junaid) #4
  1. Is it possible a node to have simultaneously the Master + Data role ?

Yes, you can set node.data & node.master both as true for that.

According to your plan ,only one single data node will be hosted 3 shards and the remaining 2 nodes will have no shards ? Am I correct ? 

Two data nodes will be hosting shards. There will be a total of (3 + 3 = 6) shards since you are setting number_of_replicas as 1. There are 2 Data + Master eligible nodes. These will be hosting shards (3 on each nodes), where as the 3rd one (Master only) will not have any shard allocations.

  1. How do you define the shard size ?

By shard size, are you referring to number of shards in an index. We define number of shards in an index at the time of creating index or by using index templates. You need to do size estimations when you are creating it, how much data are you expecting it to hold. If you think an index will have a size of around 100GB, it would be better to allocate 3 or 4 shards. It is better to over allocate an extra shard initially when creating index, if you want to reduce it that can be done later