S3 bucket size holding the snapshots is 2-2.5x of the total disk used of the cluster

Hi,

I have a cluster (ES 7.11.1) with 6 nodes and the total disk used is roughly ~4.5TB. I store logging data in it using data streams. My data stream patterns are like abc-lmn-xyz. The data retention period is 14 days.

I have an SLM policy that runs every hour where I configured it to pick up indices matching the pattern *abc*. The expiry set in s3 for snapshots is 15 days.

{
  "name": "<abc-hourly-{now{yyyy-MM-dd't'HH:mm:ss.SSS'z'}}>",
  "schedule": "0 0 * * * ?",
  "repository": "s3_repository_ds",
  "config": {
    "indices": [
      "*abc*"
    ],
    "ignore_unavailable": true
  },
  "retention": {
    "expire_after": "15d"
  }
}

The total size of data in s3 was found to be ~11TB even though my total disk size used is ~4.5TB

Is this s3 storage size expected, or is my policy misconfigured?

Doesn't seem totally unreasonable to me. Data in snapshots is deduplicated where possible, but if you take a snapshot, do some more indexing and take another snapshot then it's possible that all the files in the shard are different (no deduplication is possible) which would take up double the storage size.

Do you really need to retain every hourly snapshot for the full 15 days? You could retain hourlies for 2 days say and then just 12-hourly ones for the remainder of the time.

1 Like

Sorry, I am a bit confused here. When you say store hourly for 2 days and 12 hourlies for 15 days. What value does it add? Won't the size of both the snapshots be the same? Am I missing out something here?

I understand that the snapshots are incremental so be it hourly or 12 hourly, the total storage is going to be the same.

Snapshots are not incremental, at least not at the document level. Each snapshot contains the full set of data but segments are reused if the have add ready been copied and are unchanged. The API keeps track of which segments are in use by which segments are in use by which snapshots. When merging occurs data is copied over to a new segment which means the repository can hold the same data multiple times which explains the larger size.