Cluster topology consideration

Hi all,

I have questions about cluster topology definition :

  1. I have read that the best configuration is one shard per index per node. So how do you deal with this recommendation when having indices per day or per month ?

  2. Trying to find the # of nodes and shards per index for our use cases, I am looking for your use cases :
    Do you build one cluster per use case (i.e: a webapp search engine over documents that grows slowly, a huge logs indexing like Sharepoint's logs, ...) ?
    Or do you add nodes in you cluster as new use cases comes ?

If you are using daily indexes then each day is a separate new index. The one shard per index per node rule would apply for each daily index, not all of them together.

There is a good ElasticON video about sizing that you may want to watch...

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

Rob

Thank you Robert

I'll have a look a the video.

Just to be sure I don't misunderstood the rule: the best situation is to have on each node only one shard per index ?

I know that time based indices will create a new index each day or month, but if I want to strictely apply this rule, my node # will grow very fast. Am I wrong ?

the best situation is to have on each node only one shard per index

Well. Yes as you can be sure that all the power available on this machine will only be used by your single shard.
But that's a waste of resources in most cases. A single node can fortunately hold much more than one single shard.
Depending on what you are doing you can start with a number of shards equal to the number of cores you have (rule of thumb), then increase that number and see if you still meet your requirements.

Let's say it a different way... For each index, you would have not more than one shard per node. However multiple indexes can use each node.

For example, if your index is configured for 2 shards and 1 replica (typical of a time-based use-cases like logs) you would have a total of four shards, two primary and a replica of each. According to the rule you would want four nodes. Shards would be written like this...

Node 1 | Node 2 | Node 3 | Node 4
------ | ------ | ------ | ------
Mon P1 | Mon P2 | Mon R1 | Mon R2
Tue P1 | Tue P2 | Tue R1 | Tue R2
Wed P1 | Wed P2 | Wed R1 | Wed R2

This is simplified a bit as Elasticsearch won't put all primaries on the same nodes, but the main point is there are multiple indexes using each node, but only a single shard per node for each index.

I am not so sure that I agree with David, that matching shards to cores is a good rule of thumb. In my experience (which I am sure is less than David's so I am open to him correcting me) you will be limited by Disk IO long before you are limited by CPU. The main reason to spread shards across multiple servers is to spread the disk IO load across those servers. Similarly, the main reason to feed your servers lots of RAM is to hold as much of the working set in cache as possible further avoiding Disk IO. The one shard per index per node rule is basically saying that you want to be able to run your queries as parallel as possible in order to spread the Disk IO load as much as possible maximizing throughput.

Hopefully that helps.

Rob

1 Like

Thank you Robert for this great explanation. It is perfectly clear.

Thank you also David. I do not attend to respect strictely the rule of "one shard per node per index" but it helps me sizing the cluster.

Just one more question if I may, is there any recommendation about the ratio between memory and disk size for cluster nodes ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.