Shards rebalance when adding nodes and using one primary and one replica

Hello !

I plan to use one primary and one replica by day.

If I add more nodes, does the shards will be automatically rebalanced equally on all the nodes even if I have only one primary and one replica ?

I'm asking this because in this article, they say that we must "overshard" to have room to add nodes and grow in the future :

Thanks for your feedback !

Each shard will only reside on a single node, so with one primary and one replica two nodes will be used for each index. Different indices will however be spread out across the available data nodes so all nodes will have data.

Whether to deviate from this sharding practice will depend on how much you index per day, now many indices you are indexing into and your query rate/patterns. Increasing the number of primary shards in order to spread out load may seem appealing, but can result in large number of very small shards which will be inefficient and cause a different set of problems.

If you tell us a bit more about your use case and expected data volumes and retention requirements we may be able to provide better giuidance.

OK thanks. So what is explained in linked article is wrong right ? Oversharding is not really necessary as balance/rebalance happen despite the number of primary shard per index and the number of nodes (correct me if I'm wrong)

Use case is log collection for 600GB/day. We would like a retention of 1 month in HOT and 6 in COLD nodes

Daily volume by data type (I think to create one daily index by data type and rotate firewall logs every 50G using ILM - you think it's a good idea ?) :

  • Firewall : 300G
  • Web : 60G
  • Unix : 30G
  • Windows : 50G
  • Switch : 10G
  • Wireless : 5G
  • Database : 4G

When using time-based indices there is rarely any need to overshard as you can change the index settings in your index template and have that apply to new indices if you need to scale out. Recent versions of Elasticsearch also support the split index API which can be used to split indices if shards end up too large.

I would recommend setting the number of primary shards per index based on the expected data volume and then use ILM to roll over based on actual size. There is often no need to have indices cover an exact time period as long as the time period each index covers aligns with the retention period, e.g. do not generate monthly indices if your retention period is relatively short.

Thanks for the recommendation Christian

you speak about the load right ?

It can depending on use case be load and/or capacity.

Good to know. Thank you Christian !