A new replication type: physical replication

djjsindy · November 19, 2018, 8:36am

I read this blog a few months ago ：http://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html  ,There is indeed a very core case in my work that can be used for this function.I developed the function of physical replication on elasticsearch （segment replication）.
**In the recent Alibaba double 11 carnival, my core cluster turned on this physical replication, and cluster write performance throughput increased by more than 40%.**
**This function I want to submit it to elastic, now I will share some of my ideas with this feature.**
 **Do you think this function is feasible? Do you have any suggestions for this design?**

Experienced analysis
The existing default replication architecture is a logical replication. After the primary replica index is created, you need to create the index again in the replica shard. You can try to copy the data after the primary creation of the index to the replica shard. The index can be created once, and both the primary and replica can be used. Reduced write costs, and improved write performance .
We found that Lucene's author, Michael McCandless, made a simple framework for physical replication, Lucene-replicator, in Lucene 4.x. In 2017.9, He wrote an article about [lucene physical replication](http://blog.mikemccandless.com/ 2017/09/lucenes-near-real-time-segment-index.html) , this framework is based on lucene's segment copy. His opinion is that the primary and replica segments must be consistent, the primary to do merge, replica do not merge, replica will completely copy the primary segment.
We can also see from the source code and Michael's blog that the framework is most concerned with the details of the primary-replica visibility delay. This framework describes how to ensure the minimum latency of the primary-replica replication:
1.1 The primary shard refresh segment, you can copy the segment of the memory to ensure the delay of the visibility of the primary and replica. The specific approach is to first pass the full amount of meta to the replica, and prepare the segment to be copied according to the meta and local segment diff. Finally, copy the diff of these segments.
1.2 The primary merge comes out with a large segment, and the primary merge thread is kept through the block, and the big segment is not in the segment_x file until the large segment copy to replica , and the large segment information is written into the segment_x. This will affect the delay of segment replication (there is already a large file already exists, you do not need to spend time copying again)
1.3 According to the above, the copy segment is divided into segment copy (real-time write) generated by refresh and segment copy (history data merge) generated by merge.
My job
I designed a physical replication framework for elasticsearch In my own version. I design the overall architecture from the following points:
2.1 Reduce write CPU usage (boost write throughput)
2.1.1 The write process of the primary shard is unchanged. In order to ensure data consistency, the replica shard only needs to write translog, and no need to parse mapping and write lucene.
2.2 Segment fast incremental replication ( real-time incremental segment replication)
2.2.1 The primary shard periodically generates a segment in the memory (through the refresh operation), and primary shard sends the full amount of meta to the replica, and obtains the segment list according to the local segment diff meta increment. After a round of copying is completed, the prepared reader is refreshed, causing the reader to load the latest segment.
2.2.2 Isolation of refresh segment send and merge segment send, using different thread pools for isolation.
2.3 primary and replica data can be seen with low latency (merge historical segment replication)
2.3.1 The use of IndexReaderWarmer for segment precopy allows the primary segment to be copied to the replica shard before the primary shard reader can read the large segment. After the copy is completed, the primary merge is finally completed, so that when the primary reads the large segment. At that time, in fact, this segment already exists in the replica shard , and the time for copying this large segment is saved, which greatly reduces the delay of the visibility of the primary and replica data.
2.4 keep data consistency
2.4.1 elasticsearch can perform primary/replica switchover when the original primary shard is abnormal. In physical replication mode, when the primary/replica switchover, the replica segment is behind the data state in the primary memory. Before switching to the primary, you need to perform fast translog replay, so that the new primary data is consistent with the state in the translog. This process requires fast and reduces the RPO time.

DavidTurner · November 19, 2018, 9:37am

Mike's idea is certainly feasible, but is a very substantial change to Elasticsearch with different performance characteristics (not all of them better). The idea is something we've discussed a few times internally already, but I've raised this post for internal discussion too.

djjsindy · November 19, 2018, 9:49am

thanks david.
I did talk to mike in the email, I put the physical replication in our version of elasticsearch and verified it in our core business. Mike saids "Wow, this is truly incredible progress! "
I really want to know in what case, physical replication will have different effects, and even lead to performance degradation.
I really want to contribute this feature to the community, and I look forward to further communication with you.

DavidTurner · November 19, 2018, 12:50pm

For instance I'd expect refreshes to take longer, because we have to build a segment on the primary and then copy it to each replica, rather than building segments concurrently on every shard copy. I'd also expect replication to take more bandwidth because each document must be replicated once to the remote translog (for durability) and then a second time as part of the segment in which it first appears, and then every subsequent time it appears in a merged segment too. I'm not saying these are necessarily bad tradeoffs to make, but they are tradeoffs.

djjsindy · November 19, 2018, 2:22pm

You want to refreshes to take longer, Physical replication is more suitable for this case, the primary build index, and then replica copy this index. Physical replication does have more trade off . It is a way to convert cpu consumption to network consumption. I also asked mike this question. He saids ”Fast networking interconnect (10g and faster) has been getting cheaper and cheaper with time so copying even large merged segments may not be too slow“ . Indeed, in my case, the first bottleneck is disk io, not network bandwidth . Under large write traffic, the network occupies 400MB/s (Physical network speed is above 10G), and the disk io may exceed 800MB/s.

DavidTurner · November 21, 2018, 3:51pm

We discussed this as a team again today. The performance points raised above are significant:

Elasticsearch is deployed in a very wide variety of environments and trading off CPU against bandwidth is not going to work everywhere. For instance inter-AZ traffic on AWS carries a financial cost, so the extra bandwidth consumed by segment-based replication would add to the costs of running an Elasticsearch cluster there that may well not be offset by the CPU savings. Furthermore those extra costs are very hard to predict because merges are not very predictable, and merges can now have a cluster-wide impact on performance rather than these effects being isolated to each node.
The extra time it adds to refreshes would not be acceptable to some users.

Some other points were also raised:

The allocation of primaries would have to be balanced across the cluster, because primaries would be doing more work than the replicas. Today there is no such balancing algorithm. Additionally, for this balancing we would need to be able to demote a primary back to being a replica which is not possible to do gracefully today.
Because different versions of Elasticsearch use different Lucene versions, it would be challenging to support mixed-version clusters if the segments were being replicated.
There are some advantages to refreshes occurring in a more coordinated fashion, but it would also add quite some complexity.
One can achieve a similar sort of load profile by performing indexing with number_of_replicas: 0 and then adding replicas once indexing is complete, because the recovery of a brand-new replica operates mostly on the segments themselves.

It was quite a long and interesting discussion, and a number of advantages of segment-based replication were identified too, but on balance the conclusion was that this wasn't a path we expect to follow in the foreseeable future.

djjsindy · November 29, 2018, 11:29am

We have indeed considered these trade-offs:

We think that in the intenal-az case, the internal network is not expensive. On the contrary, if the public network is expensive, the price will be much more expensive.
For merge replication, it is possible to use replication speed limit (similar to recovery copy limit) to prevent network bandwidth from being uncontrollable.
In the long term, the price of the cpu should be significantly higher than the price of the intranet bandwidth.

in contrast:

The strategy of the primary and replica allocations is to be modified.
Segment replication is not suitable for long-term lucene versions in the cluster.

Based on the scenario analysis we have used online for physical replication, physical replication does have a significant advantage in terms of write performance requirements and high availability.

adichad · November 30, 2018, 11:53pm

"because the recovery of a brand-new replica operates mostly on the segments themselves"

although in practice on a cluster serving live traffic all the time, setting number-of-replicas: 0 and reindexing all data is not an option, your statement exposes the fact that what is being asked may actually be somewhat low-hanging fruit. especially since i suspect recovery of lost/stale replicas is done in a similar manner since the translog is truncated every few seconds everywhere anyway.

this existing behaviour (the translog is truncated periodically and then new replicas created by copying segments up to the last commit and then "suffix" translog is replayed, if i understand correctly) can actually be leveraged to introduce a per-index "index.replication-mode" or similar setting, defaulting to the current behaviour, but allowing another mode called, say, "copy-segments-and-translog". we shall copy the translog only to retain the promotability of a replica to primary status in case the primary shard fails. this trades off freshness (nearness to real-time) to save potentially a lot of cpu-cycles on replicas. with adaptive replica selection tweaked a bit, this could be a very strong cost argument for many deployments.

sorry to mention the competition but you probably already know about this: https://sematext.com/blog/solr-7-new-replica-types/

just saying that the ability to trade-off freshness/bandwidth to save CPU could benefit many bottom lines.

djjsindy · December 1, 2018, 7:46am

This replication mode is completely different from solr7 and does not depend on the commit point.

We only need to rely on the primary shard refresh，send the segment infos meta in memory to the replica shard. Replica shard diff local file and primary shard meta get the list of files needed from the primary copy.These are asynchronous requests and will not affect the refresh time. There is a problem with logical replication. Since the refresh time is randomly uncontrollable, the time when the primary and the replica see the data is uncontrollable（It may happen that you see the latest status first, then the historical status, and finally the latest status.）.Physical replication ensures that replica the data is seen after the primary, and the time is controllable.From the actual situation, under the daily pressure of our core scene, the primary and the replica see the data delay within 10ms. Under the peak pressure, the average delay is within 50ms, and the maximum delay is 200ms, so this time is lower than the delay of Solr7 at least 2 orders of magnitude.

The following is the delay diagram at 2018-11-11 00:00:00. The ordinate unit of this graph is milliseconds

These have nothing to do with the commit, and there is no very low periodic truncation of the translog.

Physical replication does not abandon the checkpoint system. Instead, it is a sublimation of checkpoint on physical replication.

In the physical replication, the existing checkpoint is only the position of the indicated translog, and does not indicate the location of the data. When the replica shard is used for recovery, this checkpoint is used to copy the translog from the primary shard.

Physical replication in the primary shard uses the local checkpoint in memory to indicate the progress of the segment asynchronous replication.

adichad · December 6, 2018, 8:16am

sounds very interesting for me and a few others. is it possible to share your code?

system · January 3, 2019, 8:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.