Why use checksum to verify segment files?

When I restart the cluster,I found that many copies are copied from 0,the reason is that the checksum of the segment files is inconsistent,and then I wrote a demo by using the Lucene interface,the result is the same operation, and the checksum is inconsistent.
So why use checksum to verify the consistency of Lucene files?This may cause all segment files to be copied each time when the replica is recovering?

Because it is vital that the data read from disk is the data that was previously written. A checksum mismatch indicates that some of the data has been changed since it was written, which means it cannot be trusted. The usual reason for this is faulty storage hardware.

But I found that consistent operations can also produce different checksums. Is it too easy to use cheksum alone?
In addition, sync_id cannot guarantee that the primary shard and copies are consistent when the cluster is restarted.Because after the sync_id of the primary shard is updated, the replicas may not be restored and the sync_id has not been updated synchronously.

Sorry, I did not really understand your question because it seems to be confusing a number of distinct concepts. Checksums are used to verify segment files (and other things) locally to protect against corruption. But you seem to be asking about peer recovery too. Checksums are verified during peer recovery (also to protect against corruption) but this is unrelated to the sync id used to detect whether any recovery is needed.

On closer reading I think you are actually asking about replica allocation after a restart. We recently merged a big improvement to how replicas are allocated after a restart in #46959 to compare the contents of shard copies rather than using the checksums of the underlying segment files. Is this what you are asking about?

Thank you for your reply!Actually,I have two questions:

  1. After restart, when allocating replica shards, the checksum will be checked to ensure whether the current node is the most matching node.I found that the replica shard was assigned to a new node because the checksum verification was inconsistent(The verification shows that the checksums of the segment files that with same operations are inconsistent,Is it appropriate to use checksum in this case?).
  2. The phase1 of PEER RECOVERY ,will check whether the checksum is consistent again.If I want to skip phase1 ,then I have to make sure that the sync_id is consistent(As I said above, it's hard to ensure that the sync_id are consistent).

Thanks again!

In addition, I have seen the optimization content of the new version, but the version I am using has not modified this part of the logic. Therefore, I want to know what the purpose of the previous version by using checksum to find the best macthing node

Thank you

Sorry, it is unclear what you're asking. You have (fairly accurately) described how things work in older versions, but there aren't really any questions to answer there.

Looking for identical segments is easy and fast and reliable and often quite accurate too: it's often highly likely that older indices will have identical segments since they will have been subject to a file-based recovery at some point in the past. In contrast, it took many many years of effort to build the groundwork needed to implement #46959.

In other words, checksum does not guarantee the consistency of segment files.Therefore, is it too strict to use checksum to judge whether the segment files are consistent in the older verison ?

For example,the content of my segment files are consistent,but I still can't guarantee that their checksums are consistent.

Because I did the same operation on the same index by using The Interface Of Lucene , then compared the _0.cfs files and find that the cheksum of the two files ( _0.cfs ) is not the same.

Maybe this is what must be done in the process of version evolution.

Thank you for your patience!