Elasticsearch 8.3 intermittent snapshot issues with AWS S3 on EKS


I'm having difficulties with snapshotting with S3 on AWS. After an undetermined amount of ES snapshots to AWS S3 fail with an access denied error. The only resolution appears to be to restart all data and master nodes.

Issue seems similar to IRSA on AWS EKS unable to use web identity token with repository-s3 · Issue #83826 · elastic/elasticsearch · GitHub however I haven't found any consistency of when failures begin. Nor do I know how to reproduce the issue other than waiting for it to occur again.

Elasticsearch Details:
Host: EKS 1.21
Image: docker.elastic.co/elasticsearch/elasticsearch:8.3.1
Snapshot Settings:

  "s3backup": {
    "type": "s3",
    "uuid": "fI_SEkhPQ5qCwZaEcFa_Fg",
    "settings": {
      "bucket": "redacted",
      "endpoint": "s3.us-west-2.amazonaws.com",
      "server_side_encryption": "true",
      "max_restore_bytes_per_sec": "500mb",
      "storage_class": "intelligent_tiering",
      "use_throttle_retries": "true",
      "readonly": "false",
      "base_path": "snapshots",
      "region": "us-west-2",
      "max_snapshot_bytes_per_sec": "500mb"

Pod Env Variables:

AWS_REGION: us-west-2
AWS_ROLE_ARN: redacted
AWS_ROLE_SESSION_NAME: snapshot-repo
AWS_WEB_IDENTITY_TOKEN_FILE: /var/run/secrets/eks.amazonaws.com/serviceaccount/token

Note: snapshots are configured using the AWS_WEB_IDENTITY_TOKEN_FILE with the symlink to /usr/share/elasticsearch/config/repository-s3/aws-web-identity-token-file

When snapshots are working s3 logs on AWS shows it's utilizing the correct IAM role. When it's not working it shows it's using the cluster's IAM role.


  "error": {
    "root_cause": [
        "type": "repository_verification_exception",
        "reason": "[s3backup-test] path [s3backup] is not accessible on master node"
    "type": "repository_verification_exception",
    "reason": "[s3backup-test] path [s3backup] is not accessible on master node",
    "caused_by": {
      "type": "i_o_exception",
      "reason": "Unable to upload object [s3backup/tests-OZH3AX3YRyaM-uRg8ryT6A/master.dat] using a single upload",
      "caused_by": {
        "type": "amazon_s3_exception",
        "reason": "amazon_s3_exception: Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 4JG93A0MCAEMKGW7; S3 Extended Request ID: B5MTtu/eW3DB+DFr0HE93IOk0DykJkszaKS4LJDpE6ZBRYyK0v8FIa+MrDFYjZEsdAo1SIDaU1o=)"
  "status": 500

AWS support has noted that Elasticsearch seems to be reaching out to the global sts endpoint rather than the regional one despite all environmental variables being set to the proper region. They would like to know if it's possible to set this in Elasticsearch.