Removing second S3 repository causes "Connection Pool Shutdown" in 8.13.2

We upgraded to 8.13.2 after having issues in 8.10 with Elasticsearch does not refresh AWS Web Identity Token file when changed on disk · Issue #101828 · elastic/elasticsearch · GitHub

We just had a cluster running 8.13.2 lose access to the S3 repository. Restarting all of the Master pods in K8s allowed us to view the snapshots again and run a verification that shows all Data nodes cannot talk to the S3 repo. The nodes were up for 46 days before snapshots started failing.

We are around 20 ES clusters and so far this is the only one we've seen affected.

The Verify response after restarting all masters is:

{
  "name": "ResponseError",
  "message": "repository_verification_exception\n\tRoot causes:\n\t\trepository_verification_exception: [s3backup] [[3OY76lDlQEeXLa7ewplPgA, 'org.elasticsearch.transport.RemoteTransportException: [es-data-33][10.10.10.10:9300][internal:admin/repository/verify]'], [Same for the every data node ...

We discovered that having the Masters able to talk to the S3 Repo, but not the data nodes caused high load on the Masters for some reason. From typical 8% CPU, they ran at 75-100% CPU constantly.
We restarted all data nodes and still were seeing high Masters CPU.
Final resolution was to delete the S3 repo and recreate it ... suddenly everything dropped to expected levels again.
Could be somethiing to do with:
Elasticsearch Exporter collecting snapshot statistics (it was failing and OOM'ing until the repo was deleted)
Bad state when Masters can see S3 repo, but data nodes can't

Halfway through a new snapshot, we again started receiving errors. After a chat with AWS we discovered the EKS pod began using the Instance Role instead of the Pod Role

This appears to be something to do with the AWS token refreshing, perhaps during a snapshot, that causes the S3 Client to shut down. It's not closed by ES so all future calls fail:

com.amazonaws.AmazonClientException: java.lang.IllegalStateException: Connection pool shut down\n\tat com.amazonaws.auth.RefreshableTask.refreshValue(RefreshableTask.java:303)\n\tat com.amazonaws.auth.RefreshableTask.blockingRefresh(RefreshableTask.java:251)\n\tat com.amazonaws.auth.RefreshableTask.getValue(RefreshableTask.java:192)\n\tat com.amazonaws.auth.STSAssumeRoleWithWebIdentitySessionCredentialsProvider.getCredentials(STSAssumeRoleWithWebIdentitySessionCredentialsProvider.java:130)

We doubled the memory of the masters, but this can still happen.

It appears this is because this cluster had a second S3 Repository added, then removed. We haven't found a way to recover from this yet, but see that 8.15.0 has a release note that potentially this was fixed.

Removing the second S3 Repository, then restarting all masters ... then restarting all data nodes appears to have solved the problem. :crossed_fingers:

Seeing other reporting same issue: Removing one of the s3 snapshot repository causing connection pool shutdown
Repository_verification_exception

It appears that the "snapshot retention" job is the one that causes the failure, not the snapshot itself

Resolved (for a few days now at least) by removing all but one S3 repo. Restarting all nodes in the cluster, then immediately removing old snapshots up to our current retention.

1 Like