I have a system containing billions of documents. I have a requirement to have a consistent point in time copy of a subset of fields from 1 year's worth of data (eg 200M documents). The documents are subject to change, in the order of a small fraction of the documents involved, but enough to be concerned about having an assured consistent, complete copy of the data "frozen" for future reference. The copy will be used by another system, and kept to provide evidence for why the other system produced its output.
The options I can think of are:
- Extract the data from Elastic and write to another permanent store that will use this data.
- Snapshot the data within Elastic.
For (1) I guess something like point-in-time is the way to ensure consistency? We are running 6.2.3 and so don't have access to features like this, so I suppose we would have to risk inconsistency.
For (2) we would still need to extract the data to a permanent store as the Elastic store is not truly permanent for us. However I'm not sure if I understand the snapshot process in terms of data consistency correctly.
What would be the best way to achieve this?