Removing second S3 repository causes "Connection Pool Shutdown" in 8.13.2

Doc_Kaos · August 12, 2024, 7:47pm

We upgraded to 8.13.2 after having issues in 8.10 with Elasticsearch does not refresh AWS Web Identity Token file when changed on disk · Issue #101828 · elastic/elasticsearch · GitHub

We just had a cluster running 8.13.2 lose access to the S3 repository. Restarting all of the Master pods in K8s allowed us to view the snapshots again and run a verification that shows all Data nodes cannot talk to the S3 repo. The nodes were up for 46 days before snapshots started failing.

We are around 20 ES clusters and so far this is the only one we've seen affected.

The Verify response after restarting all masters is:

{
  "name": "ResponseError",
  "message": "repository_verification_exception\n\tRoot causes:\n\t\trepository_verification_exception: [s3backup] [[3OY76lDlQEeXLa7ewplPgA, 'org.elasticsearch.transport.RemoteTransportException: [es-data-33][10.10.10.10:9300][internal:admin/repository/verify]'], [Same for the every data node ...

Doc_Kaos · August 20, 2024, 2:09pm

We discovered that having the Masters able to talk to the S3 Repo, but not the data nodes caused high load on the Masters for some reason. From typical 8% CPU, they ran at 75-100% CPU constantly.
We restarted all data nodes and still were seeing high Masters CPU.
Final resolution was to delete the S3 repo and recreate it ... suddenly everything dropped to expected levels again.
Could be somethiing to do with:
Elasticsearch Exporter collecting snapshot statistics (it was failing and OOM'ing until the repo was deleted)
Bad state when Masters can see S3 repo, but data nodes can't

Doc_Kaos · August 20, 2024, 4:49pm

Halfway through a new snapshot, we again started receiving errors. After a chat with AWS we discovered the EKS pod began using the Instance Role instead of the Pod Role

Doc_Kaos · August 20, 2024, 6:51pm

This appears to be something to do with the AWS token refreshing, perhaps during a snapshot, that causes the S3 Client to shut down. It's not closed by ES so all future calls fail:

com.amazonaws.AmazonClientException: java.lang.IllegalStateException: Connection pool shut down\n\tat com.amazonaws.auth.RefreshableTask.refreshValue(RefreshableTask.java:303)\n\tat com.amazonaws.auth.RefreshableTask.blockingRefresh(RefreshableTask.java:251)\n\tat com.amazonaws.auth.RefreshableTask.getValue(RefreshableTask.java:192)\n\tat com.amazonaws.auth.STSAssumeRoleWithWebIdentitySessionCredentialsProvider.getCredentials(STSAssumeRoleWithWebIdentitySessionCredentialsProvider.java:130)

We doubled the memory of the masters, but this can still happen.

Doc_Kaos · August 21, 2024, 1:09pm

It appears this is because this cluster had a second S3 Repository added, then removed. We haven't found a way to recover from this yet, but see that 8.15.0 has a release note that potentially this was fixed.

Doc_Kaos · August 23, 2024, 1:17pm

Removing the second S3 Repository, then restarting all masters ... then restarting all data nodes appears to have solved the problem.

Doc_Kaos · August 26, 2024, 2:03pm

Seeing other reporting same issue: Removing one of the s3 snapshot repository causing connection pool shutdown
Repository_verification_exception

Doc_Kaos · August 26, 2024, 2:22pm

It appears that the "snapshot retention" job is the one that causes the failure, not the snapshot itself

Doc_Kaos · August 29, 2024, 2:08pm

Resolved (for a few days now at least) by removing all but one S3 repo. Restarting all nodes in the cluster, then immediately removing old snapshots up to our current retention.

Topic		Replies	Views
Connection Pool Shutdown happening very frequently for s3 snapshot repository Elasticsearch snapshot-and-restore	0	34	January 23, 2025
Removing one of the s3 snapshot repository causing connection pool shutdown Elasticsearch slm-snapshot-lifecycle-management , snapshot-and-restore	2	552	May 11, 2023
Any update to snapshot repositoruy requires to restart ES cluster Elasticsearch snapshot-and-restore	1	19	March 2, 2025
Repository_verification_exception Elasticsearch snapshot-and-restore	3	207	July 16, 2024
Elaticsearch cluster back up is failing - "reason":"RepositoryMissingException" Elasticsearch docker , snapshot-and-restore	11	347	March 21, 2023

Removing second S3 repository causes "Connection Pool Shutdown" in 8.13.2

Related topics