Data fs grows forever after full snapshot is taken

Hi,
we have 3 elasticsearch nodes, elasticsearch-8.11.1-1.x86_64 and CentOS Linux release 7.3.1611 (Core) o.s. are installed on premis.
We have very busy read/write systems.

We are taking a full snapshot once a day, overwriting the existing one with the following cron command:

08 13 * * * curl -XDELETE localhost:9200/_snapshot/backup/backupfull; curl -XPUT localhost:9200/_snapshot/backup/backupfull

The snapshot creation takes 10 minutes without any error.

The snapshot is taken on a separate fs, not the same fs where data is located.

Randomly, not all the time, after the snapshot is taken, the data fs starts growing.
The size of the index segments is still the same, is not growing, but the data filesystem is growing forever.
If we stop and start elasticsearch the space drop down to the usual occupation. If we close/open the indexes the space drop down as well.

We have decided to comment the snapshot crontab command and we are not experiencing this strange behaviour any more.

Can you help us to find out what is going on?
Thank you,
Tina

Hi Tina & welcome!

Sounds strange indeed. Can you capture the full contents of the data fs (e.g. run ls -lR path/to/my/data) just after taking the snapshot, and then again say 20min later when it's been growing for a while. Also GET _segments at the same times.

This seems like a bad plan btw, don't delete your backup before creating the next one. For starters, it means there's a period of time where you have no backup at all, but also if you take today's snapshot before deleting yesterday's snapshot then Elasticsearch will notice that most of the data hasn't changed which should make the process much quicker.

Also I'd recommend just using SLM rather than your own cron job. SLM is much more robust, e.g. it'll handle failures properly and won't delete your last-good snapshot.

the backup is saved on a separate device (rubrik backup) before the deletion, so we keep it. If we perform the snapshot before the deletion it failed because snapshot with the same name already exists.

I collect some stats and I will provide you.

Rubrik claims to have some very clever deduplication functionality, but I wonder if it really works as well as the deduplication built into ES itself. In any case it's still a lot more work (including IO, blowing your page cache, and network traffic) for ES to take a full backup of everything every day.

If you call the snapshot <backupfull_{now/d}> then ES will include today's date in its name.

That is really a good point, we decided to follow your suggestions and use slm.
We scheduled a backup through kibana.
We keep an eye on it, hopefully we won't have any side effect.

1 Like