How to achieve fault tolerance

I have a 8 node elk cluster, 3 data dedicated nodes, 3 master dedicated node and 2 http(kibana) nodes. Currently the indexes are daily created with date stamp. It contains around 21293408 docs on daily average. The indexes are 15 ~ 25 GB. Also currently i have no of shard equals to 1 and replica is 1. I am trying to achieve FT, so in case of two data node failure it can recover smoothly. What will be the no of shards and replica do i need to have ?

Already referred the links.
[Link 1]( 2)
Link 2
Link 3

Its not giving me a formula or something so that i can achieve FT.

If you want to be able to handle 2 data nodes going down at once without data loss you need the number of replicas set to 2. I am however not sure that you would’ve able to continue indexing with only one shard remaining as quorum might be required. If you add another data node you should be fine though.

1 Like

Yes that's right, number_of_shards: 1 and number_of_replicas: 2 is all that's needed. There is no quorum system for indexing.


Thanks both, loosing data node is only when there is a patching goes on, So after patching the nodes comes quickly back online.

Right now I has 3 data nodes, I am going to increase it to 4 or if elk needs odd number then to 5. Also I may need to change 3 shards ( matching one per node) and two replicas (2 it will be on the other nodes i guess)


This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.