CCR exception/warning issue. RetentionLeaseNotFoundException

Scenario

  1. Create leader cluster
  2. Create index with index.soft_deletes.retention_lease.period = 5s (***). This period I choose only for test purposes.
  3. Index document and wait 5s before next step so retention lease expire
  4. Create follower cluster
  5. Follow leader’s index

Expected:

  1. Data is replicated and seen in follower cluster
  2. Restore status api shows that index was fully copied, because the retention release expired

Actual :

  1. Data is replicated and seen in follower cluster

  2. Don’t see that index entirely copied

  3. Continues warnings in log of the follower in which I see below exception:

    {"type": "server", "timestamp": "2020-12-08T10:28:59,459Z", "level": "WARN", "component": "o.e.x.c.a.ShardFollowTasksExecutor", "cluster.name": "elasticsearch2", "node.name": "elasticsearch2-master-0", "message": "[project_green_replicated][1] background management of retention lease [elasticsearch2/project_green_replicated/bUhiTHeyTSOEDjUDKkDEdA-following-leader/project_green/vMKgmrHhQui7j7hP2UZNww] failed while following ", "cluster.uuid": "8phR09OoRJ2mRm-Ctu3O2Q", "node.id": "ppg4KsC7S-Gm58GzWXu1XQ" ,

    "stacktrace": ["org.elasticsearch.index.seqno.RetentionLeaseNotFoundException: retention lease with ID [elasticsearch2/project_green_replicated/bUhiTHeyTSOEDjUDKkDEdA-following-leader/project_green/vMKgmrHhQui7j7hP2UZNww] not found ",

    "at org.elasticsearch.index.seqno.ReplicationTracker.renewRetentionLease(ReplicationTracker.java:397) ~[elasticsearch-7.10.0.jar:7.10.0]",

    "at org.elasticsearch.index.shard.IndexShard.renewRetentionLease(IndexShard.java:2218) ~[elasticsearch-7.10.0.jar:7.10.0]",

    "at org.elasticsearch.index.seqno.RetentionLeaseActions$Renew$TransportAction.doRetentionLeaseAction(RetentionLeaseActions.java:206) ~[elasticsearch-7.10.0.jar:7.10.0]",

Might be that my expectations is wrong, especially point b. But this is what I understand from https://www.elastic.co/guide/en/elasticsearch/reference/7.x/xpack-ccr.html#ccr-leader-requirements

Quote from doc:

(***) The index.soft_deletes.retention_lease.period setting defines the maximum time to retain a shard history retention lease before it is considered expired. This setting determines how long the cluster containing your leader index can be offline, which is 12 hours by default. If a shard copy recovers after its retention lease expires, then Elasticsearch will fall back to copying the entire index, because it can no longer replay the missing history.

This is the expected behaviour with such a short retention period. Retention leases don't expire in normal operation: expiry is treated as an indication of failure.

So then

  1. Ok suppose it happened. Why this exception coming continuously and making my logs bigger and bigger?
  2. What if I created leader and follower, indexed documents, at the middle of replication leader failed. It recovered after lease period expired... Will I see such a warning in logs? Will I see in restore status that entire index was copied?
  3. Don't sure what is the "such a short retention period" Could you please elaborate more on short/long times....

The exceptions are continuous because the failures are continuous, because you have set the retention lease expiry period so short. Set it back to the default and you will be fine.

Yes, you'll see warnings if a follower returns after ≥12 hours of absence. That makes sense to me: a 12-hour outage would be pretty severe.

The default is 12 hours. You have selected 5 seconds. 5 seconds is a lot shorter than 12 hours.

But I see that missing documents that are indexed >12h ago are replicated and still these exceptions are coming.

Doc says:

The index.soft_deletes.retention_lease.period setting defines the maximum time to retain a shard history retention lease before it is considered expired. This setting determines how long the cluster containing your leader index can be offline, which is 12 hours by default. If a shard copy recovers after its retention lease expires, then Elasticsearch will fall back to copying the entire index, because it can no longer replay the missing history.

So I wanted to check scenario when leader came up after 12h, then "Elasticsearch will fall back to copying the entire index, because it can no longer replay the missing history." And wanted to see in recovery stats that it really happens...

Did few more tests

First test

  1. Create leader index (retention 1h)
  2. Index document
  3. Wait for 1h
  4. create follower cluster and follow

Result
Data is replicated
No stats in recovery stats that index entirely copied
Exceptions each 20-30 sec

Second test

  1. Create leader index (retention 1h)
  2. Index document
  3. create follower cluster and follow
  4. pause follower
  5. Index document to leader
  6. wait for 1 h
  7. resume follower

Result
Data is replicated, again don't see in recovery stats that it copies all the index from scratch. And exception thrown only once.

Totally confused.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.