CCR replication paused indices

Hi Guys,

I am new to this CCR setup. The CCR setup was working fine when we had fewer and smaller indices, but as of today we have many indices. Monitoring of the ccr was not done for quite sometime. We saw there were a few paused indices and enabling it took us like more than 4 hours to get in sync. I mean it should be more of a realtime transfer of data but its not happening. So need some help on how to tackle/ handle replication on DR site. We have enterprise license.

After running GET /_ccr/stats as of today i see
"auto_follow_stats" : {
    "number_of_failed_follow_indices" : 21753791,
    "number_of_failed_remote_cluster_state_requests" : 16,
    "number_of_successful_follow_indices" : 0,
}
the type of exception i see for auto_follow is "reason" : "index to follow [abc_au_error_2022.10.01] for pattern [abc*] matches with other patterns [abc]".

On individual index also there are various fatal exceptions
"fatal_exception" : {
              "type" : "index_not_found_exception",
              "reason" : "no such index [abc_au_error_2022.02.20]"
}
"fatal_exception" : {
              "type" : "circuit_breaking_exception",
              "reason" : "[parent] Data too large, data for [<transport_request>] would be [24467651698/22.7gb], which is larger than the limit of [24398446592/22.7gb], real usage: [24467650928/22.7gb], new bytes reserved: [770/770b], usages [request=0/0b, fielddata=30948187/29.5mb, in_flight_requests=158504/154.7kb, accounting=620829855/592mb]",
              "bytes_wanted" : 24467651698,
              "bytes_limit" : 24398446592,
              "durability" : "PERMANENT"
}

"fatal_exception" : {
              "type" : "process_cluster_event_timeout_exception",
              "reason" : "failed to process cluster event (put-mapping) within 30s"
}

"fatal_exception" : {
              "type" : "resource_not_found_exception",
              "reason" : "Operations are no longer available for replicating. Existing retention leases [[RetentionLease{id='peer_recovery/lc-tMfA4QFyrm2M1XgbMVg', retainingSequenceNumber=6801469, timestamp=1662746628089, source='peer recovery'}, RetentionLease{id='peer_recovery/wgyOlKHmS--NyIWW9Le1ow', retainingSequenceNumber=6801469, timestamp=1662746628089, source='peer recovery'}]]; maybe increase the retention lease period setting [index.soft_deletes.retention_lease.period]?",
              "requested_operations_missing" : [
                "6797017",
                "6801468"
              ],
              "caused_by" : {
                "type" : "illegal_state_exception",
                "reason" : "Not all operations between from_seqno [6797017] and to_seqno [6801468] found; expected seqno [6797017]; found [Index{id='13500', type='_doc', seqNo=6801369, primaryTerm=7, version=23666, autoGeneratedIdTimestamp=-1}]"
              }
}

Since i am new to this DR related CCR activity. if i can get some pointers on how to tackle this would be helpful.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.