CCR exception/warning issue. RetentionLeaseNotFoundException

Armen_Petrosyan · December 8, 2020, 11:07am

Scenario

Create leader cluster
Create index with index.soft_deletes.retention_lease.period = 5s (***). This period I choose only for test purposes.
Index document and wait 5s before next step so retention lease expire
Create follower cluster
Follow leader’s index

Expected:

Data is replicated and seen in follower cluster
Restore status api shows that index was fully copied, because the retention release expired

Actual :

Data is replicated and seen in follower cluster
Don’t see that index entirely copied
Continues warnings in log of the follower in which I see below exception:

{"type": "server", "timestamp": "2020-12-08T10:28:59,459Z", "level": "WARN", "component": "o.e.x.c.a.ShardFollowTasksExecutor", "cluster.name": "elasticsearch2", "node.name": "elasticsearch2-master-0", "message": "[project_green_replicated][1] background management of retention lease [elasticsearch2/project_green_replicated/bUhiTHeyTSOEDjUDKkDEdA-following-leader/project_green/vMKgmrHhQui7j7hP2UZNww] failed while following ", "cluster.uuid": "8phR09OoRJ2mRm-Ctu3O2Q", "node.id": "ppg4KsC7S-Gm58GzWXu1XQ" ,

"stacktrace": ["org.elasticsearch.index.seqno.RetentionLeaseNotFoundException: retention lease with ID [elasticsearch2/project_green_replicated/bUhiTHeyTSOEDjUDKkDEdA-following-leader/project_green/vMKgmrHhQui7j7hP2UZNww] not found ",

"at org.elasticsearch.index.seqno.ReplicationTracker.renewRetentionLease(ReplicationTracker.java:397) ~[elasticsearch-7.10.0.jar:7.10.0]",

"at org.elasticsearch.index.shard.IndexShard.renewRetentionLease(IndexShard.java:2218) ~[elasticsearch-7.10.0.jar:7.10.0]",

"at org.elasticsearch.index.seqno.RetentionLeaseActions$Renew$TransportAction.doRetentionLeaseAction(RetentionLeaseActions.java:206) ~[elasticsearch-7.10.0.jar:7.10.0]",

Might be that my expectations is wrong, especially point b. But this is what I understand from https://www.elastic.co/guide/en/elasticsearch/reference/7.x/xpack-ccr.html#ccr-leader-requirements

Quote from doc:

(***) The index.soft_deletes.retention_lease.period setting defines the maximum time to retain a shard history retention lease before it is considered expired. This setting determines how long the cluster containing your leader index can be offline, which is 12 hours by default. If a shard copy recovers after its retention lease expires, then Elasticsearch will fall back to copying the entire index, because it can no longer replay the missing history.

DavidTurner · December 8, 2020, 3:23pm

This is the expected behaviour with such a short retention period. Retention leases don't expire in normal operation: expiry is treated as an indication of failure.

Armen_Petrosyan · December 9, 2020, 6:08am

So then

Ok suppose it happened. Why this exception coming continuously and making my logs bigger and bigger?
What if I created leader and follower, indexed documents, at the middle of replication leader failed. It recovered after lease period expired... Will I see such a warning in logs? Will I see in restore status that entire index was copied?
Don't sure what is the "such a short retention period" Could you please elaborate more on short/long times....

DavidTurner · December 9, 2020, 7:32am

The exceptions are continuous because the failures are continuous, because you have set the retention lease expiry period so short. Set it back to the default and you will be fine.

Yes, you'll see warnings if a follower returns after ≥12 hours of absence. That makes sense to me: a 12-hour outage would be pretty severe.

The default is 12 hours. You have selected 5 seconds. 5 seconds is a lot shorter than 12 hours.

Armen_Petrosyan · December 9, 2020, 8:19am

But I see that missing documents that are indexed >12h ago are replicated and still these exceptions are coming.

Doc says:

The index.soft_deletes.retention_lease.period setting defines the maximum time to retain a shard history retention lease before it is considered expired. This setting determines how long the cluster containing your leader index can be offline, which is 12 hours by default. If a shard copy recovers after its retention lease expires, then Elasticsearch will fall back to copying the entire index, because it can no longer replay the missing history.

So I wanted to check scenario when leader came up after 12h, then "Elasticsearch will fall back to copying the entire index, because it can no longer replay the missing history." And wanted to see in recovery stats that it really happens...

Armen_Petrosyan · December 9, 2020, 3:00pm

Did few more tests

First test

Create leader index (retention 1h)
Index document
Wait for 1h
create follower cluster and follow

Result
Data is replicated
No stats in recovery stats that index entirely copied
Exceptions each 20-30 sec

Second test

Create leader index (retention 1h)
Index document
create follower cluster and follow
pause follower
Index document to leader
wait for 1 h
resume follower

Result
Data is replicated, again don't see in recovery stats that it copies all the index from scratch. And exception thrown only once.

Totally confused.

system · January 6, 2021, 3:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CCR user permission issue. [indices:admin/seq_no/renew_retention_lease] is unauthorized for user [ccr_user] Elasticsearch elastic-stack-security , ccr-cross-cluster-replication	5	862	July 21, 2021
CCR - Remove follow_stats and License fetching problem Elasticsearch ccr-cross-cluster-replication	8	2201	May 9, 2019
CCR follower for large index fails Elasticsearch ccr-cross-cluster-replication	1	1153	January 5, 2021
Cross Cluster Replication for existing indexes Elasticsearch ccr-cross-cluster-replication	9	481	June 29, 2023
Unable to do cross cluster replication , Please help here Elasticsearch ccr-cross-cluster-replication	11	686	April 21, 2023

CCR exception/warning issue. RetentionLeaseNotFoundException

Related topics