Some key points first:
- We do not advertise Elasticsearch as a primary data store but many customers are using it as a primary data store.
- Elasticsearch is an eventually consistent, near-realtime storage engine.
- Elasticsearch is not ACID compliant like a database.
- Elastic Cloud provides snapshots every 30 minutes.
That being said, the current status of Elasticsearch resiliency is documented here:
There will be more resiliency improvements coming in v5.x release but many improvements have been realised in v2.x and back ported to v1.x.
This decision is risk analysis around how much data you could lose and your ability to replay/restore that data to a point in time if needed or ability to cope with some data missing if you choose not to restore. It depends on your particular use case and tolerance for data loss. For some things it's critical and for other things it might not matter so much for some data to be missing or slightly out of date.
With the ability to perform incremental snapshots regularly, you should be able to restore to a fairly recent point in time to get data back. There's also many other techniques to limit loss or downtime such as how many nodes you have, how many replicas, your disk raid configuration,etc.
Also the use of brokers/queues such as Kafka/RabbitMQ/others can help guard against data loss by keeping recent updates in a separate queue for the ability to play back data from a point in time. Brokers also afford you other benefits for cluster maintenance without data loss by queuing updates during maintenance for flowing into the cluster after maintenance is complete.