S3 API Costs are extraordinary expensive for snapshots

To store 5TB of data, we are paying about $1,200 in storage fees per month and $10,000 in API calls

Is there a way to fix this? During a snapshot we are seeing upwards of 120k s3 api calls/minute

SLM:

PUT _slm/policy/hourlies
{
  "name": "<hourly-snap-{now/d}>",
  "schedule": "0 0 * * * ?",
  "repository": "s3backup",
  "config": {
    "ignore_unavailable": true,
    "include_global_state": true,
    "feature_states": [],
    "partial": true
  },
  "retention": {
    "expire_after": "14d",
    "min_count": 1,
    "max_count": 24
  }
}

And the repo is:

  "s3backup": {
    "type": "s3",
    "uuid": "UUID",
    "settings": {
      "bucket": "bucketname",
      "endpoint": "s3.us-thing-1.amazonaws.com",
      "server_side_encryption": "true",
      "max_restore_bytes_per_sec": "500mb",
      "storage_class": "intelligent_tiering",
      "use_throttle_retries": "true",
      "readonly": "false",
      "base_path": "cluster_name",
      "region": "us-thing-1",
      "max_snapshot_bytes_per_sec": "500mb"
    }
  }
}

This cluster is running 8.6.2,
Nodes: 24
Indices: 5076
Documents: 850M
Disk Usage: 1.4TB
Primary Shards: 15k
Replica Shards: 15k

API calls during snapshots scale approximately with the number of shards, and that is a tiny amount of data for such a high shard count.

2 Likes

I agree :smiley:

Unfortunately it's a legacy model :frowning: If there is no other option, maybe I'll look at convincing people it's worth refactoring ...

Our search rate is quite high which maybe an issue for less shards.

When there's up to $10k/mo of savings up for grabs, hopefully it's an easy sell :slightly_smiling_face:

The only other option I can think of is to replace S3 with something that doesn't charge per request, maybe a self-managed NFS or Minio. But then you've got to think about the admin hassle.

100k searches per sec across 24 nodes doesn't sound especially high to me, although it very much depends on the details of your workload. You might even get better performance with fewer shards. Make sure to use Rally to benchmark your experiments.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.