I am using an ES cluster with thousands of documents spread across several
types in a single index. Rather small compared to the size of most ES
instances I see on the list.
I am deploying to EC2 using a local index and the S3 gateway. ES is the only
data store, so if I have to reindex my data because of a mapping change or
corruption of the S3 gateway I would have no way to get my original
documents.
I have a long term solution to persist data as it is written to ES to
another data store for safekeeping. In the meantime, I have a job which
performs a scan search of all records in my ES index and writes them to S3.
It writes about 5,000 records in about 30 seconds, and most of that time is
spent writing the records one-by-one to S3 over HTTP. Not very efficient,
but it is working for now.
I cannot shut down the ES server for 30 seconds while I write these records,
so I had a couple questions.
- When scan executes, does it cache all of the ids of the documents
which match the query? - As I fetch documents, does scan return me the version of the document
which existed at the time of the initial scan, or at the time of the
subsequent scrollId request?
Neither S3 or SimpleDB seem to have a snapshot capability. Does anyone else
have any thoughts on how to backup ES? I imagine that most people are using
ES as a secondary store for search purposes only, but I think more and more
people are wanting to ditch their primary storage in favor of ES.
Thanks.