I know that this has been discussed earlier, but not all advantages were discussed in the earlier post. Also, the earlier discussion happened really long back and I would like to understand if you still feel that it does not make sense to take these trade-offs in favour of better performance.
I am thinking that this could be added as a new replication type instead of changing the existing logical replication. Here are a few points in addition to the ones already discussed in the old post:
- Refresh delay and co-ordination: We don't need to wait on replicas to finish the refresh before opening the readers on primary. Just sending a notification to start the refresh on replicas should be enough. We could even provide this as a configuration to users based on their consistency requirements.
- Cross cluster replication: Instead of piggybacking on logical replication, cross cluster replication could be done via this new replication model. You don't need to read per operation in source and write per operation in destination cluster. All that you need to do is the data transfer across the clusters. Plus, the additional data transfer cost may be equivalent in most cases as you don't need to transfer individual operation level uncompressed data. Operation level data size may turn out to be much more than that of compressed segment data, even after including the merge cost.
- Unpredictable merge data transfer costs: Here, some optimizations in the more recent Lucene versions around compacting the segments during refreshes and flush could potentially help as the created segments would be potentially larger. Despite this, merge data transfer costs still remain unpredictable, but so do merge compute and IO costs. As Mike already mentioned in his blog, trading off network would turn out to be better in most scenarios. There should at least be an option for customers to select the replication model, if that provides benefit in most scenarios. Also, we could build support for doing merges on replicas by writing some sort of a merge operation log on replicas instead of sending over entire merge segments. There are other ways of achieving this too.
- Read only replicas: This replication model provides users an ability to scale reads independent of writes. The read only copies don't participate in failovers, but continue to provide reads.
- Primary becoming choke point for all replicas: In the current model, the primary is responsible for moving data to all replicas. In the new model, primary could just send a new commit point to replicas, who could then copy the data from their peers, who act as repeaters. Hence, you end up utilizing network bandwidth across all nodes and not just a single node.
- Performing indexing with 0 replicas as an alternative: This may be okay for some users, but this could lead to data loss and hence may not work for users who would want their store to provide durability out of the box.
- Trading off CPU with network may not work everywhere: Because we will provide this as a new replication type, users can choose what trade-offs they wish to make depending on hardware, instead of asking them to use a less-performant replication model.
- Failovers and rebalancing: There are 2 complexities with failover in the new model that the logical replication model solves elegantly by design. The replica may not be up to date and local recovery from translog(copied during indexing) could take some time. The time to recover locally from translog would depend on refresh interval. Hence, refresh not only determines freshness of your data, but also failover time, which is a new trade-off that user needs to understand. This problem does not exist with read only replicas and cross cluster replication though. Next problem is sudden jump in replica's compute and memory accounting on the node when it becomes primary, creating imbalance in the cluster. I agree this problem is tricky to solve and shard allocation logic may become complex. One of the ways of solving this is providing different weights to primary and replicas and rebalancing when the replica becomes primary. Once primary joins back, replica should be demoted and primary should be made replica again. We could potentially delegate the rebalancing to users as well by allowing them to configure rollover rules based on replica/primary imbalance, where-ever applicable e.g. with log use cases. A side effect of not having identical primary and replica is lower static availability of the system during node failures as the system may not be scaled enough to handle node failures. The durability is guaranteed though.
- Mixed version clusters: Any logical replication would need some physical replication as history cannot be preserved forever. ES does not support version compatibility for nodes that can't read data written via older lucene versions. If it did, the failure scenarios would cause physical replication to trigger even in the existing model. Maybe I did not understand this point well in the older post. Can you explain what you meant by challenges in mixed version clusters?
PS: I did a rough PoC on 7.4 ES version on 2 i3.8xlarge machines with 1 index(1 primary and 1 replica). The throughput improvements for read only replica is 110%, while that with failover replica is close to 80%. Refresh interval was chosen as 60 seconds. The additional network copy was 10% higher than logical replication. The dataset that I used was similar to a typical metrics use case with lot of numbers and few keywords. The refresh lag was close to 1s across primary and replica. The client side latency of requests also reduced considerably. I can share more details, if needed.