S3 bucket size holding the snapshots is 2-2.5x of the total disk used of the cluster

MitParekh · July 16, 2021, 8:09pm

Hi,

I have a cluster (ES 7.11.1) with 6 nodes and the total disk used is roughly ~4.5TB. I store logging data in it using data streams. My data stream patterns are like abc-lmn-xyz. The data retention period is 14 days.

I have an SLM policy that runs every hour where I configured it to pick up indices matching the pattern *abc*. The expiry set in s3 for snapshots is 15 days.

{
  "name": "<abc-hourly-{now{yyyy-MM-dd't'HH:mm:ss.SSS'z'}}>",
  "schedule": "0 0 * * * ?",
  "repository": "s3_repository_ds",
  "config": {
    "indices": [
      "*abc*"
    ],
    "ignore_unavailable": true
  },
  "retention": {
    "expire_after": "15d"
  }
}

The total size of data in s3 was found to be ~11TB even though my total disk size used is ~4.5TB

Is this s3 storage size expected, or is my policy misconfigured?

DavidTurner · July 17, 2021, 7:28am

Doesn't seem totally unreasonable to me. Data in snapshots is deduplicated where possible, but if you take a snapshot, do some more indexing and take another snapshot then it's possible that all the files in the shard are different (no deduplication is possible) which would take up double the storage size.

Do you really need to retain every hourly snapshot for the full 15 days? You could retain hourlies for 2 days say and then just 12-hourly ones for the remainder of the time.

MitParekh · July 17, 2021, 7:29pm

Sorry, I am a bit confused here. When you say store hourly for 2 days and 12 hourlies for 15 days. What value does it add? Won't the size of both the snapshots be the same? Am I missing out something here?

I understand that the snapshots are incremental so be it hourly or 12 hourly, the total storage is going to be the same.

Christian_Dahlqvist · July 17, 2021, 8:01pm

Snapshots are not incremental, at least not at the document level. Each snapshot contains the full set of data but segments are reused if the have add ready been copied and are unchanged. The API keeps track of which segments are in use by which segments are in use by which snapshots. When merging occurs data is copied over to a new segment which means the repository can hold the same data multiple times which explains the larger size.

system · August 14, 2021, 8:01pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snapshot Repository is 8x larger than total clustered data!?!?! Elasticsearch slm-snapshot-lifecycle-management	1	26	December 18, 2024
Snapshot Bloat Elasticsearch	4	505	July 5, 2017
Issues with snapshots Elasticsearch	6	287	May 1, 2023
Elasticsearch backup - S3 repository Elasticsearch	2	182	April 28, 2023
Snapshot management on S3 Elasticsearch slm-snapshot-lifecycle-management	2	446	June 28, 2020

S3 bucket size holding the snapshots is 2-2.5x of the total disk used of the cluster

Related topics