ElasticSearch Recovery after data loss due to an Out of memory error

Mike_Peters · October 12, 2011, 5:59pm

Hi,

Running ElasticSearch 0.16.4 in a 2 node setup (shards=3,
replication=1), last night one of the nodes displayed an "Out of
memory" error in its log.

This morning it became apparent something is terribly wrong.

Shortly after the Out of memory error, ElasticSearch wiped out about
2/3 of our index.

We tried a rolling restart, complete cluster restart, flush, refresh,
but nothing helped. The cluster comes back up with everything showing
up healthy, but 2/3 of the data is GONE.

Not sure if this is a known bug?

To recover we rsync'ed an old index backup and manually pushed all
updates from the last 24 hours.

Wondering if there is a more natural way of recovery? Right now we
are using the local-gateway. Would using s3 have helped us in such a
situation?

In our experience ElasticSearch built-in recovery does the trick in
most cases, but we did experience a handful of cases such as this
recent one, where something becomes corrupted beyond ElasticSearch
ability to automatically recover it.

We're looking to establish a best-practices procedure on how to
recover in these cases.

Would appreciate any feedback from the community as well as thoughts
about whether gateway-s3 helps in providing better recovery.

Our index size is 100gb and it changes frequently.

Thanks,
Mike Peters

kimchy · October 14, 2011, 4:10pm

The problem with data possibly being lost on an OOM has been fixed in 0.17.
Local gateway aim is to work its way up to a state where you never loose
data, and it provides substantial benefit over a shared gateway because it
considerably more lightweight.

On Wed, Oct 12, 2011 at 7:59 PM, Mike Peters mike@softwareprojects.comwrote:

Hi,

Running Elasticsearch 0.16.4 in a 2 node setup (shards=3,
replication=1), last night one of the nodes displayed an "Out of
memory" error in its log.

This morning it became apparent something is terribly wrong.

Shortly after the Out of memory error, Elasticsearch wiped out about
2/3 of our index.

We tried a rolling restart, complete cluster restart, flush, refresh,
but nothing helped. The cluster comes back up with everything showing
up healthy, but 2/3 of the data is GONE.

Not sure if this is a known bug?

To recover we rsync'ed an old index backup and manually pushed all
updates from the last 24 hours.

Wondering if there is a more natural way of recovery? Right now we
are using the local-gateway. Would using s3 have helped us in such a
situation?

In our experience Elasticsearch built-in recovery does the trick in
most cases, but we did experience a handful of cases such as this
recent one, where something becomes corrupted beyond Elasticsearch
ability to automatically recover it.

We're looking to establish a best-practices procedure on how to
recover in these cases.

Would appreciate any feedback from the community as well as thoughts
about whether gateway-s3 helps in providing better recovery.

Our index size is 100gb and it changes frequently.

Thanks,
Mike Peters

AEvar_Arnfjord_Bjarm · October 15, 2011, 10:30am

On Oct 14, 2011 6:10 PM, "Shay Banon" kimchy@gmail.com wrote:

The problem with data possibly being lost on an OOM has been fixed in
0.17. Local gateway aim is to work its way up to a state where you never
loose data, and it provides substantial benefit over a shared gateway
because it considerably more lightweight.

While stress-testing 0.17.6 I had some data corruption like this, the issue
was fixed in 0.17.7

Topic		Replies	Views
Lost data due to out of memory error Elasticsearch	6	1408	July 6, 2017
Recovery from S3 gateway - only one shard recovers? Elasticsearch	10	456	July 6, 2017
Recover from Out of Memory Error Elasticsearch	6	1310	July 5, 2017
Data loss with 0.19.8 Elasticsearch	3	636	July 6, 2017
Node not available exception is occurred after Out of memory error Elasticsearch	2	567	July 6, 2017

ElasticSearch Recovery after data loss due to an Out of memory error

Related topics