What is the status of resiliency with the latest ES version?

I found this, but didn't figure much out.

How safe is it (in terms of data loss) to use ES 6.1.2 as primary data store?

Hi,

It's hard to say how safe something is. We try to be very open about bugs / problems we know about so people can make an informed decision. The resiliency page you linked is the main vehicle for that. As you can see a ton of work went into this area. ES 6.x is in a totally different universe than 1.x when we started the project. That said, there are still issues we are working to fix. Those typically require multiple cascading failures to occur.

Perhaps it will help you the most if I share how I would approach this if I design an Elasticsearch based system. The first premise is that to prevent data loss you always need to store the data in multiple places. No single place is 100% safe. All software has bugs and even if it didn't all human operators will make mistakes that may end up in data deletion or corruption (in the sense that the wrong thing was indexed).

This means it all boils down to - how often do you expect to lose data, how fast is the recovery on data loss and how much are you willing to invest to minimize the impact if it happens.

Let me unpack it via a few examples - if you store logs for debugging purposes and you are not required to keep them for any auditing requirement, maybe it doesn't make sense for you to invest in storage costs and operational skills of maintaining a second copy of the data out of Elasticsearch.

If you have time based data and you can afford some limited down time for recent data, maybe it's enough to store raw JSON in some cheap storage like S3. In the rare occasion a problem occurs, you can reindex starting from the most recent data which become available quickly again.

If you can afford some down time on all data but not to much, you can rely on the Snapshot/Restore API of Elasticsearch. If something happens you restore your data from the backup without having to reindex again.

If copying data takes too long, but you don't want the complexity of replicating your writes, you can have two separated ES cluster and use snapshot and restore to transfer data from one to another (restoring into closed indices). Now you're investing more but your backup data is available behind a single api call (open the indices).

I hope this helps you formulate your thoughts about this and reach a decision that makes sense for your use case and your resources.

Boaz

4 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.