ElasticSearch Recovery after data loss due to an Out of memory error

Hi,

Running ElasticSearch 0.16.4 in a 2 node setup (shards=3,
replication=1), last night one of the nodes displayed an "Out of
memory" error in its log.

This morning it became apparent something is terribly wrong.

Shortly after the Out of memory error, ElasticSearch wiped out about
2/3 of our index.

We tried a rolling restart, complete cluster restart, flush, refresh,
but nothing helped. The cluster comes back up with everything showing
up healthy, but 2/3 of the data is GONE.

Not sure if this is a known bug?

To recover we rsync'ed an old index backup and manually pushed all
updates from the last 24 hours.

Wondering if there is a more natural way of recovery? Right now we
are using the local-gateway. Would using s3 have helped us in such a
situation?

In our experience ElasticSearch built-in recovery does the trick in
most cases, but we did experience a handful of cases such as this
recent one, where something becomes corrupted beyond ElasticSearch
ability to automatically recover it.

We're looking to establish a best-practices procedure on how to
recover in these cases.

Would appreciate any feedback from the community as well as thoughts
about whether gateway-s3 helps in providing better recovery.

Our index size is 100gb and it changes frequently.

Thanks,
Mike Peters

The problem with data possibly being lost on an OOM has been fixed in 0.17.
Local gateway aim is to work its way up to a state where you never loose
data, and it provides substantial benefit over a shared gateway because it
considerably more lightweight.

On Wed, Oct 12, 2011 at 7:59 PM, Mike Peters mike@softwareprojects.comwrote:

Hi,

Running Elasticsearch 0.16.4 in a 2 node setup (shards=3,
replication=1), last night one of the nodes displayed an "Out of
memory" error in its log.

This morning it became apparent something is terribly wrong.

Shortly after the Out of memory error, Elasticsearch wiped out about
2/3 of our index.

We tried a rolling restart, complete cluster restart, flush, refresh,
but nothing helped. The cluster comes back up with everything showing
up healthy, but 2/3 of the data is GONE.

Not sure if this is a known bug?

To recover we rsync'ed an old index backup and manually pushed all
updates from the last 24 hours.

Wondering if there is a more natural way of recovery? Right now we
are using the local-gateway. Would using s3 have helped us in such a
situation?

In our experience Elasticsearch built-in recovery does the trick in
most cases, but we did experience a handful of cases such as this
recent one, where something becomes corrupted beyond Elasticsearch
ability to automatically recover it.

We're looking to establish a best-practices procedure on how to
recover in these cases.

Would appreciate any feedback from the community as well as thoughts
about whether gateway-s3 helps in providing better recovery.

Our index size is 100gb and it changes frequently.

Thanks,
Mike Peters

On Oct 14, 2011 6:10 PM, "Shay Banon" kimchy@gmail.com wrote:

The problem with data possibly being lost on an OOM has been fixed in
0.17. Local gateway aim is to work its way up to a state where you never
loose data, and it provides substantial benefit over a shared gateway
because it considerably more lightweight.

While stress-testing 0.17.6 I had some data corruption like this, the issue
was fixed in 0.17.7