Backup repository size is much bigger than indices size

I have an Elasticsearch backup per hour, after about 370 times of backup (about 15 days) , my backup repository is more than 15G !!! But the total indices size is just about 500M !! Elasticsearch is incremental backup, but 15G VS 500M , the difference is so huge ! I wonder whether it is normal with so big different size between indices and backup repository ?
Is it caused by my frequent backup (hourly) ? I use the hourly backup in cluster 1 and hourly restore in cluster 2 to keep two ES clusters data same real time .

======
My Elasticsearch settings : 2 nodes , 12 shard/node , 2 indices , fs type of backup to store snapshots to NAS

in Elasticsearch data directory , the indices size :

node1 indices size

[root@esnode1 indices]$ du -sh
266M .

node2 indices size

[root@esnode2 indices]$ du -sh
238M .

in backup repository , the size :

[root@esnode1 backup]$ du -lh
114M ./backup/indices/index1/0
112M ./backup/indices/index1/5
114M ./backup/indices/index1/11
114M ./backup/indices/index1/10
111M ./backup/indices/index1/8
116M ./backup/indices/index1/4
120M ./backup/indices/index1/9
118M ./backup/indices/index1/3
114M ./backup/indices/index1/2
115M ./backup/indices/index1/7
115M ./backup/indices/index1/1
112M ./backup/indices/index1/6
1.4G ./backup/indices/index1
747M ./backup/indices/index2/0
1.6G ./backup/indices/index2/5
887M ./backup/indices/index2/11
743M ./backup/indices/index2/10
2.1G ./backup/indices/index2/8
801M ./backup/indices/index2/4
1.3G ./backup/indices/index2/9
878M ./backup/indices/index2/3
951M ./backup/indices/index2/2
1.2G ./backup/indices/index2/7
953M ./backup/indices/index2/1
943M ./backup/indices/index2/6
13G ./backup/indices/index2
15G ./backup/indices
15G ./backup
1.1M ./backuplogs
15G .

Snapshot and restore works at the segment level and is incremental in that it will only snapshot a segment once even if it is used in multiple snapshots. This is described quite well in this blog post. As segments merge, these new segments will also be backed up, and as there will be multiple segments that hold the same records, snapshotting is not incremental at the record level.

yes, I searched and read the blog you posted from other's topic before posting this one , My case does not touch segment merge ....
I'm just curious the backup repository size is so huge (15G) compared with the indices data (500M) . almost 1 G increasement every day , it will eat up my NAS soon .

Now that it is incremental backup , my backup size should be similar to indices size or at most double it since there are some snapshost record files

If you index into Elasticsearch, segments will automatically be merged in the background.

ah , I did not realize the segments merge will automatically happen as long as there is indexing to ES
Now that , I tried by following the "Merge" part in posted blog , after merge , when backup again , the repository size increases more than double . Now sure how often merge will happen when index into ES , if one index to ES , one merge happens, then if I backup , absolutely the size of repository will be very huge . so It is expected that a large of backup repository size , right ?
I am using the backup and restore for the data sync between our production system and disaster recovery system , backup/restore happens hourly , that is to say , I have to delete the previous snapshot by API termly , right ?

Deleting old snapshots will remove segments that no snapshot longer refer to, and will reduce storage space. How many old snapshots do you need to keep? How far back in time do you need to be able to restore?

I think two months are enough from business consideration . So I can delete the old snapshots every two months.
Just want to confirm with you again

  1. It is a normal result that the indexes and backup repository have a big different size (500G VS 15G) in my case, right ?
  2. Some of redundant data in backup snapshots are caused by segment merge of Lucene , right ?
    Thanks !

Yes, that is correct. If you are constantly indexing into the cluster, merging of segments will continously happen in the background and the same record will end up in multiple segments over time, resulting in a repository that is considerably larger than the index size.

1 Like