We just had a cluster running 8.13.2 lose access to the S3 repository. Restarting all of the Master pods in K8s allowed us to view the snapshots again and run a verification that shows all Data nodes cannot talk to the S3 repo. The nodes were up for 46 days before snapshots started failing.
We are around 20 ES clusters and so far this is the only one we've seen affected.
The Verify response after restarting all masters is:
{
"name": "ResponseError",
"message": "repository_verification_exception\n\tRoot causes:\n\t\trepository_verification_exception: [s3backup] [[3OY76lDlQEeXLa7ewplPgA, 'org.elasticsearch.transport.RemoteTransportException: [es-data-33][10.10.10.10:9300][internal:admin/repository/verify]'], [Same for the every data node ...
We discovered that having the Masters able to talk to the S3 Repo, but not the data nodes caused high load on the Masters for some reason. From typical 8% CPU, they ran at 75-100% CPU constantly.
We restarted all data nodes and still were seeing high Masters CPU.
Final resolution was to delete the S3 repo and recreate it ... suddenly everything dropped to expected levels again.
Could be somethiing to do with:
Elasticsearch Exporter collecting snapshot statistics (it was failing and OOM'ing until the repo was deleted)
Bad state when Masters can see S3 repo, but data nodes can't
Halfway through a new snapshot, we again started receiving errors. After a chat with AWS we discovered the EKS pod began using the Instance Role instead of the Pod Role
This appears to be something to do with the AWS token refreshing, perhaps during a snapshot, that causes the S3 Client to shut down. It's not closed by ES so all future calls fail:
com.amazonaws.AmazonClientException: java.lang.IllegalStateException: Connection pool shut down\n\tat com.amazonaws.auth.RefreshableTask.refreshValue(RefreshableTask.java:303)\n\tat com.amazonaws.auth.RefreshableTask.blockingRefresh(RefreshableTask.java:251)\n\tat com.amazonaws.auth.RefreshableTask.getValue(RefreshableTask.java:192)\n\tat com.amazonaws.auth.STSAssumeRoleWithWebIdentitySessionCredentialsProvider.getCredentials(STSAssumeRoleWithWebIdentitySessionCredentialsProvider.java:130)
We doubled the memory of the masters, but this can still happen.
It appears this is because this cluster had a second S3 Repository added, then removed. We haven't found a way to recover from this yet, but see that 8.15.0 has a release note that potentially this was fixed.
Resolved (for a few days now at least) by removing all but one S3 repo. Restarting all nodes in the cluster, then immediately removing old snapshots up to our current retention.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.