I need to put together a DR plan for our elastic system. I have already tested the snapshot restore process, and it works. However, my process is the following:
Adjust cluster settings to allow action.destructive_requires_name to "false"
Stop Kibana pods as indexes are for *
Close all indexes via curl
Restore snapshot via curl
This process works... but the I have only tested it once all the snapshots are restored. The problem is we have way to much data in production for this to be practical. I need a way for indexes to be written to while old ones are restored. How can I accomplish this as all the indexes are closed?
I think what I need to do is rollover data streams and other indexes to new names, close all indexes but the rollover indexes, restore only to those closed indexes which leaves the rollover ones available to write to. Is this right? Note I will also need to have a way for our frontend to still interact with the API to gather this data, I think this is enabled by default. Is there an easier way or is this the only way?
data steams can only write to existing index. you can't write to previous index. it is by design.
you can't write to rollover index if you using regular template/ilm/alias because once you rollover to new index older index becomes read only and alias point to existing index which
is writable.
when you restore index it is not open until it is fully restore and initialized. This means you can't even read from it until it is ready.
Thanks for the reply.
To confirm, is it not possible to write to indexes or data streams while a restore from snapshot is happening? If so, how is that practical? Some of these restores could take weeks if not months to complete. I cant have that kind of down time, how options do I have that does not require licensing?
in any restore even a simple file copy from one machine to another. it is not available until full restore is complete. name one product who will allow you to update restoreing copy until it is fully restore.
now lets say you started restore of that document. it is half a way till writing address. and you want to go change fname: not possible because restore process as to write whole document back to source and send signal that I am done with restore
Right. This makes sense.
What if I did two different snasphots. One that backups cluster state, (All index names, data streams names, kibana, other internal indexes)
The other would be the actual data for the indexes and data streams.
Then I restore the cluster state index first. Rollover indexes and data streams, close the older indexes and data streams and then restore the second snapshot? I have no idea how to do this btw, just wondering if something like this is possible?
If not, I am curious how a restore is ever practical? What if a company has Petabytes of data to restore. You are telling me that the company needs to restore ALL petabytes for all indexes and data streams before anything can be readable / writeable again? That could take a long time. The cluster is just dead in the water until a restore is complete?
if you have petabytes of data that needs to be restore then you create index.
for example one index a month
this will give you 12 index a year for example
my_index-YYYY-MM
now you do backup of each index separately.
then when time to restore you can restore current month index "my_index-2024-12" and start writing data ( restore will be fast as less data in it)
and then start restore of all other index which does not need any data change.
but for sure you can't write data to index while it is being restore.
Isn't there a way where I can select specific indexes to restore from a snapshot? That way I don't need to make different snapshots for different indices.
There a ton of DR plans floating around, most of them are not practical in a lot of D cases, but everyone has to just pretends they are cos, well, you know.
Given what you wrote, I'd try partition your data to "recent" (a small subset) and "old" (most of it) and have a different DR plan/approach for each. I've no idea of your use case, but say last 7 days data is defined as recent, the rest, say last 12 months, is "old". In that case recent is just 2% of the entire data. In case of D, you try to restore a close-to-BAU service by looking at recent only first, say aiming for RTO of an hour, or 12 hours, or whatever, and then over longer period of time the rest of the data is backfilled.
As Sachin has pointed out, this whole idea is predicated on the data being some sort of time-series, and organised/indexed appropriately. Or otherwise partition-able and partitioned. You wrote:
"The cluster is just dead in the water until a restore is complete?".
No, the restored cluster could be quickly usable, just certain indices are not usable until that specific index's restore is complete.
Some of these restores could take weeks if not months to complete.
Then the business owner has implicitly accepted that weeks to months for full restoration of service, incl access to all restored data, is acceptable. It's really that simple. Those of us old enough to have dealt with WORM tape backups in offsite locations have read seen these movie scripts before.
Not sure I understand how this is possible in an automatic way. If I tell Elastic to restore my snapshot, all indexes, it's not clear to me in what order the snapshot indexes are restored in. Let's just say the snapshot is restored from newest to oldest for the sake of discussion. This would require me to monitor the state of the restore, and make sure the specific indexes that I need restored are restored, and the indexes are open again. Its also not clear to me if an index opens automatically after IT is restored, or once all indexes in the snapshot are restored. I suspect that I would need to manually open the indexes after they are restored too. To me it seems easier to restore the specific snapshot indexes I want to be restored first, start writing to them, then let the rest restore. Is that not valid?
Then the business owner has implicitly accepted that weeks to months for full restoration of service, incl access to all restored data, is acceptable. It's really that simple. Those of us old enough to have dealt with WORM tape backups in offsite locations have read seen these movie scripts before.
Yes, from a business perspective, this makes sense. But from what I have been testing, I think it's possible to restore some of the data, write to it, then restore the rest. Was just wondering if others used the same process, and what that would be.
This would require me to monitor the state of the restore, and make sure the specific indexes that I need restored are restored
There's a universe where you would not expect someone / something to be doing this? "Trust, but verify" is great advice.
Its also not clear to me if an index opens automatically after IT is restored
So, you haven't tested restoring even a single index from a snapshot? Later you write "... from what I have been testing ...", so I think you do know the answer here, or at least should.
Not sure I understand how this is possible in an automatic way
Completely automated? From detecting the D, and knowing when the R can begin, to deciding what to do when, prioritizing things, validating whatever caused the outage is truly resolved, etc? Don't set your bar too high here.
I think you are conflating 2 things here, which btw we all do:
The technical side of how elasticsearch manage/uses snapshots
The business ramification, and need for "plans", that look like they are akin to turning a handle.
Obviously you know how your data is named/structured, so you can script restores to run automatically in whatever order you seem fit. You can choose what to do to validate that index is now "good". So
To me it seems easier to restore the specific snapshot indexes I want to be restored first, start writing to them, then let the rest restore. Is that not valid?
@RainTown Thanks for all your input on this. I think my process is valid, however I have one piece that is still an issue for me, is trying to figure out why restoring single indexes apart of a data stream does not restore their backing indexes. The documentation says snapshot restore data stream indexes and backing indexes
If you restore a data stream, you also restore its backing indices.
But it also says it doesn't if its single indexes?
You can restore only a specific backing index from a data stream. However, the restore operation doesn’t add the restored backing index to any existing data stream.
I can only refer to what Sachin wrote, which admittedly I cannot confirm myself - Trust, but verify!. But I have no reason to believe it is not correct.
data steams can only write to existing index. you can't write to previous index. it is by design.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.