I read this blog a few months ago ：http://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html ,There is indeed a very core case in my work that can be used for this function.I developed the function of physical replication on elasticsearch （segment replication）.
In the recent Alibaba double 11 carnival, my core cluster turned on this physical replication, and cluster write performance throughput increased by more than 40%.
This function I want to submit it to elastic, now I will share some of my ideas with this feature.
Do you think this function is feasible? Do you have any suggestions for this design?
a. Experienced analysis
The existing default replication architecture is a logical replication. After the primary replica index is created, you need to create the index again in the replica shard. You can try to copy the data after the primary creation of the index to the replica shard. The index can be created once, and both the primary and replica can be used. Reduced write costs, and improved write performance .
We found that Lucene's author, Michael McCandless, made a simple framework for physical replication, Lucene-replicator, in Lucene 4.x. In 2017.9, He wrote an article about lucene physical replication , this framework is based on lucene's segment copy. His opinion is that the primary and replica segments must be consistent, the primary to do merge, replica do not merge, replica will completely copy the primary segment.
We can also see from the source code and Michael's blog that the framework is most concerned with the details of the primary-replica visibility delay. This framework describes how to ensure the minimum latency of the primary-replica replication:
a.a The primary shard refresh segment, you can copy the segment of the memory to ensure the delay of the visibility of the primary and replica. The specific approach is to first pass the full amount of meta to the replica, and prepare the segment to be copied according to the meta and local segment diff. Finally, copy the diff of these segments.
a.b The primary merge comes out with a large segment, and the primary merge thread is kept through the block, and the big segment is not in the segment_x file until the large segment copy to replica , and the large segment information is written into the segment_x. This will affect the delay of segment replication (there is already a large file already exists, you do not need to spend time copying again)
a.c According to the above, the copy segment is divided into segment copy (real-time write) generated by refresh and segment copy (history data merge) generated by merge.
b. My job
I designed a physical replication framework for elasticsearch In my own version. I design the overall architecture from the following points:
b.a Reduce write CPU usage (boost write throughput)
The write process of the primary shard is unchanged. In order to ensure data consistency, the replica shard only needs to write translog, and no need to parse mapping and write lucene.
b.b Segment fast incremental replication ( real-time incremental segment replication)
The primary shard periodically generates a segment in the memory (through the refresh operation), and primary shard sends the full amount of meta to the replica, and obtains the segment list according to the local segment diff meta increment. After a round of copying is completed, the prepared reader is refreshed, causing the reader to load the latest segment.
Isolation of refresh segment send and merge segment send, using different thread pools for isolation.
b.c primary and replica data can be seen with low latency (merge historical segment replication)
The use of IndexReaderWarmer for segment precopy allows the primary segment to be copied to the replica shard before the primary shard reader can read the large segment. After the copy is completed, the primary merge is finally completed, so that when the primary reads the large segment. At that time, in fact, this segment already exists in the replica shard , and the time for copying this large segment is saved, which greatly reduces the delay of the visibility of the primary and replica data.
b.d keep data consistency
elasticsearch can perform primary/replica switchover when the original primary shard is abnormal. In physical replication mode, when the primary/replica switchover, the replica segment is behind the data state in the primary memory. Before switching to the primary, you need to perform fast translog replay, so that the new primary data is consistent with the state in the translog. This process requires fast and reduces the RPO time.