Recovery

James_Cook · July 12, 2010, 7:10pm

Assume the complete failure of all my instances of ES on EC2 using the
new cloud plugin for discovery and gateway.

My management console will attempt to bring up instances to replace
the ones which failed.

Can you describe the restoration process that begins as that first
node is restarted and subsequent nodes come back online?

Is there any manual intervention required by the process?

If only the data which remains is in S3 (each instance of ES failed
and took its memory and hard drive storage with it), does this hurt
recovery?

Will the system appear offline until recovery is completed?

Thanks for making search fun!

kimchy · July 12, 2010, 7:41pm

If all the instances fail, and you start new ondes, then data will be
recovered from s3. While data is being recovered (note, this might take
time, depends on s3 and the machine IO performance), the cluster will not be
available. Once a specific shard has recovered, then it will be available,
you will get partial search results (the search returns how many shards
succeeded and how many failed).

Note, in 0.9, hopefully, the recovery from s3 will be a bit faster (working
on a native implementation of s3 which should be faster than and less
resource hog than all the current ones out there).

A quick note regarding 0.9 and the upcoming reuse work dir feature. If the
machines fail, but file system is still there (assuming you store the index
on the file system), then data that already exists on the local file system
will not be recovered from s3, resulting in faster recovery. For this
feature, you would probably want to set to gateway.recover_after_nodes
setting, so the recovery process will only start after N nodes have started
(only relevant after full cluster shutdown / failure).

-shay.banon

On Mon, Jul 12, 2010 at 10:10 PM, oravecz jcook@tracermedia.com wrote:

Assume the complete failure of all my instances of ES on EC2 using the
new cloud plugin for discovery and gateway.

My management console will attempt to bring up instances to replace
the ones which failed.

Can you describe the restoration process that begins as that first
node is restarted and subsequent nodes come back online?

Is there any manual intervention required by the process?

If only the data which remains is in S3 (each instance of ES failed
and took its memory and hard drive storage with it), does this hurt
recovery?

Will the system appear offline until recovery is completed?

Thanks for making search fun!