@DavidTurner I believe this is true at the segment level and not the document level? If you have an index that is receiving a heavy indexing load could documents not end up in multiple segments snapshotted at different times as merging continously occurs even if there are no updates to documents?
Yes that's true Christian, any segments (and therefore shards and indices) that don't change between snapshots take up no extra space, but merging will cause some amount of write amplification that gets more noticeable at higher snapshot frequencies.
Okay but as snapshots also impacts the normal queries speed to some level, we would like to take snapshots only during night where load is less. So considering daily snapshots , can we consider 2X as our upper limit of s3 bucket to get an estimation of cost?
any segments (and therefore shards and indices) that don't change between snapshots take up no extra space -> this will not be effecting our case much as our indices are mostly not changed considering 24 hrs as frequency.
It Dependsā¢ on the exact pattern of writes but yeah typically I'd expect daily snapshots to see a write amplification factor lower than 2x.
Note that daily snapshots are not quite sufficient to achieve a RPO of 1 day, because creating the snapshot itself takes time. I would recommend a frequency at least twice your target RPO, i.e. no longer than 12h between snapshots in your case.
With your dev system, and now you figured out the S3 versioning thing, you don't really need to make back of an envelope style estimates.
Just use the dev system (which I assume is fairly representative) to actually calculate for your own data flows.
I agree with David on making more than once per day snapshots. Lets say disaster hits you at 23:00 (and why wouldn't it?), then if you last snapshot was ca: midnight you've lost an almost an entire days worth of data straight away, even if you are able to "recover" the rest in a timely way. And I have no idea of your infrastructure, but a recovery of 150TB of data on the production system from S3 is likely going to take a while, not even counting the time it will take to (under extreme stress) bring up your new cluster.
One point, from someone who has worked in operations roles a lot of the last 30+ years:
because the amount of data that comes in is quite huge we feel 30 mins will not be sufficient to complete the backup and we may have 2 concurrent backups running
Concurrent snapshots are ok, there's no need to settle for a worse RPO just to avoid them
This isn't a technical point, but generally the mindset of Operations is that BAU has a certain rhythm, and one aspect is that backups are not running during busy times, and certainly not overlapping. So snap@1230 starting before snap@1200 finishes gets people really nervous. Even if it shouldn't, and it's technically fine to be so.
e.g. @Dharani_Vattamwar wrote:
snapshots also impacts the normal queries speed to some level, we would like to take snapshots only during night where load is less
You see !
I wonder if taking a snapshot really does noticeably impact the search speed? i.e. Has that actually been measured here? But, whatever the answer, there is that lurking implied suspicion that it might.
Well yeah but as the OP mentioned, their RPO is 1 day so they're ok with this.
The bigger problem is that if you schedule snapshots at 00:00 and they typically take 45 minutes then when disaster strikes at 00:30 your ongoing snapshot will fail but the previous snapshot started more than 24h ago so you fail your RPO.
Sure, if you want to impose extra constraints then that's up to you, as long as you're aware that they're your constraints rather than anything ES is forcing upon you.