I have questions about cluster topology definition :
I have read that the best configuration is one shard per index per node. So how do you deal with this recommendation when having indices per day or per month ?
Trying to find the # of nodes and shards per index for our use cases, I am looking for your use cases :
Do you build one cluster per use case (i.e: a webapp search engine over documents that grows slowly, a huge logs indexing like Sharepoint's logs, ...) ?
Or do you add nodes in you cluster as new use cases comes ?
If you are using daily indexes then each day is a separate new index. The one shard per index per node rule would apply for each daily index, not all of them together.
There is a good ElasticON video about sizing that you may want to watch...
Just to be sure I don't misunderstood the rule: the best situation is to have on each node only one shard per index ?
I know that time based indices will create a new index each day or month, but if I want to strictely apply this rule, my node # will grow very fast. Am I wrong ?
the best situation is to have on each node only one shard per index
Well. Yes as you can be sure that all the power available on this machine will only be used by your single shard.
But that's a waste of resources in most cases. A single node can fortunately hold much more than one single shard.
Depending on what you are doing you can start with a number of shards equal to the number of cores you have (rule of thumb), then increase that number and see if you still meet your requirements.
Let's say it a different way... For each index, you would have not more than one shard per node. However multiple indexes can use each node.
For example, if your index is configured for 2 shards and 1 replica (typical of a time-based use-cases like logs) you would have a total of four shards, two primary and a replica of each. According to the rule you would want four nodes. Shards would be written like this...
This is simplified a bit as Elasticsearch won't put all primaries on the same nodes, but the main point is there are multiple indexes using each node, but only a single shard per node for each index.
I am not so sure that I agree with David, that matching shards to cores is a good rule of thumb. In my experience (which I am sure is less than David's so I am open to him correcting me) you will be limited by Disk IO long before you are limited by CPU. The main reason to spread shards across multiple servers is to spread the disk IO load across those servers. Similarly, the main reason to feed your servers lots of RAM is to hold as much of the working set in cache as possible further avoiding Disk IO. The one shard per index per node rule is basically saying that you want to be able to run your queries as parallel as possible in order to spread the Disk IO load as much as possible maximizing throughput.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.