Lost data due to out of memory error

dottom · August 11, 2011, 1:15am

I encountered lost data as a result of java.lang.OutOfMemoryError
condition. I am using:

0.17.4
4g min/max heap size
mlockall
single node, 1 shard per index, 0 replicas
indexed 300m documents across 6 indexes

config:

index:
number_of_shards: 1
number_of_replicas: 0
refresh_interval: 20s

boostrap:
mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents. Previous to this,
other queries returned results fine. Queries to the health endpoint
and any other query did not return a result (just "hangs"). The only
evidence of a problem other than the queries hanging was this in the
log file:

failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted. The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

[2011-08-11 01:07:58,378][WARN ][indices.cluster ] [Unuscione, Angelo] [index0006][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [index0006][0] shard allocated for local recovery (post api), should exists, but doesn't

Is there a way to predict when we might encounter out-of-memory
errors? I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually. The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.

kimchy · August 11, 2011, 8:26am

I think you might have hit this one:
Failed shard recovery can cause shard data to be deleted (replicas will still work) · Issue #1227 · elastic/elasticsearch · GitHub, which I fixed
yesterday. Did shards failed before you shutdown the node?

On Thu, Aug 11, 2011 at 4:15 AM, Tom Le dottom@gmail.com wrote:

I encountered lost data as a result of java.lang.OutOfMemoryError
condition. I am using:

0.17.4

4g min/max heap size

mlockall

single node, 1 shard per index, 0 replicas

indexed 300m documents across 6 indexes

config:

index:
number_of_shards: 1
number_of_replicas: 0
refresh_interval: 20s

boostrap:
mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents. Previous to this,
other queries returned results fine. Queries to the health endpoint
and any other query did not return a result (just "hangs"). The only
evidence of a problem other than the queries hanging was this in the
log file:

failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted. The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

[2011-08-11 01:07:58,378][WARN ][indices.cluster ] [Unuscione,
Angelo] [index0006][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[index0006][0] shard allocated for local recovery (post api), should exists,
but doesn't

Is there a way to predict when we might encounter out-of-memory
errors? I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually. The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.

dottom · August 11, 2011, 5:44pm

The logs don't show anything before the OOM error, upon restart there was a
"failed to start shard".

That issue looks very similar so it's likely the case. I will test some OOM
conditions with the new version and gist any problems. The 3 main scenarios
I want to ensure reliable recovery are:

OOM
too many open files
file system full

In theory, proper health monitoring would mean we don't encounter these
scenarios. I know you mentioned previously that both Lucene and ES goes to
great lengths to provide recovery from various failure states.

On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon kimchy@gmail.com wrote:

I think you might have hit this one:
Failed shard recovery can cause shard data to be deleted (replicas will still work) · Issue #1227 · elastic/elasticsearch · GitHub, which I fixed
yesterday. Did shards failed before you shutdown the node?

On Thu, Aug 11, 2011 at 4:15 AM, Tom Le dottom@gmail.com wrote:

I encountered lost data as a result of java.lang.OutOfMemoryError
condition. I am using:

0.17.4

4g min/max heap size

mlockall

single node, 1 shard per index, 0 replicas

indexed 300m documents across 6 indexes

config:

index:
number_of_shards: 1
number_of_replicas: 0
refresh_interval: 20s

boostrap:
mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents. Previous to this,
other queries returned results fine. Queries to the health endpoint
and any other query did not return a result (just "hangs"). The only
evidence of a problem other than the queries hanging was this in the
log file:

failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted. The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

[2011-08-11 01:07:58,378][WARN ][indices.cluster ] [Unuscione,
Angelo] [index0006][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[index0006][0] shard allocated for local recovery (post api), should exists,
but doesn't

Is there a way to predict when we might encounter out-of-memory
errors? I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually. The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.

kimchy · August 11, 2011, 5:48pm

Yes, even with a no replicas index (single node or several), you should not
loose data in those failure cases.

On Thu, Aug 11, 2011 at 8:44 PM, Tom Le dottom@gmail.com wrote:

The logs don't show anything before the OOM error, upon restart there was a
"failed to start shard".

That issue looks very similar so it's likely the case. I will test some
OOM conditions with the new version and gist any problems. The 3 main
scenarios I want to ensure reliable recovery are:

OOM

too many open files

file system full

In theory, proper health monitoring would mean we don't encounter these
scenarios. I know you mentioned previously that both Lucene and ES goes to
great lengths to provide recovery from various failure states.

On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon kimchy@gmail.com wrote:

I think you might have hit this one:
Failed shard recovery can cause shard data to be deleted (replicas will still work) · Issue #1227 · elastic/elasticsearch · GitHub, which I fixed
yesterday. Did shards failed before you shutdown the node?

On Thu, Aug 11, 2011 at 4:15 AM, Tom Le dottom@gmail.com wrote:

I encountered lost data as a result of java.lang.OutOfMemoryError
condition. I am using:

0.17.4

4g min/max heap size

mlockall

single node, 1 shard per index, 0 replicas

indexed 300m documents across 6 indexes

config:

index:
number_of_shards: 1
number_of_replicas: 0
refresh_interval: 20s

boostrap:
mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

curl -XGET http://localhost:9200/_search -d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents. Previous to this,
other queries returned results fine. Queries to the health endpoint
and any other query did not return a result (just "hangs"). The only
evidence of a problem other than the queries hanging was this in the
log file:

failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted. The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

[2011-08-11 01:07:58,378][WARN ][indices.cluster ] [Unuscione,
Angelo] [index0006][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[index0006][0] shard allocated for local recovery (post api), should exists,
but doesn't

Is there a way to predict when we might encounter out-of-memory
errors? I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually. The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.

Ben_Coe · August 12, 2011, 4:57pm

Is this problem fixed in the 0.17.5 release. If so, I'll upgrade to it
ASAP

Thanks for the hard work Shay,

-- Ben.

On Aug 11, 10:48 am, Shay Banon kim...@gmail.com wrote:

Yes, even with a no replicas index (single node or several), you should not
loose data in those failure cases.

On Thu, Aug 11, 2011 at 8:44 PM, Tom Le dot...@gmail.com wrote:

The logs don't show anything before the OOM error, upon restart there was a
"failed to start shard".

That issue looks very similar so it's likely the case. I will test some
OOM conditions with the new version and gist any problems. The 3 main
scenarios I want to ensure reliable recovery are:

OOM

too many open files

file system full

In theory, proper health monitoring would mean we don't encounter these
scenarios. I know you mentioned previously that both Lucene and ES goes to
great lengths to provide recovery from various failure states.

On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon kim...@gmail.com wrote:

I think you might have hit this one:
Failed shard recovery can cause shard data to be deleted (replicas will still work) · Issue #1227 · elastic/elasticsearch · GitHub, which I fixed
yesterday. Did shards failed before you shutdown the node?

On Thu, Aug 11, 2011 at 4:15 AM, Tom Le dot...@gmail.com wrote:

I encountered lost data as a result of java.lang.OutOfMemoryError
condition. I am using:

0.17.4

4g min/max heap size

mlockall

single node, 1 shard per index, 0 replicas

indexed 300m documents across 6 indexes

config:

index:
number_of_shards: 1
number_of_replicas: 0
refresh_interval: 20s

boostrap:
mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

curl -XGEThttp://localhost:9200/_search-d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents. Previous to this,
other queries returned results fine. Queries to the health endpoint
and any other query did not return a result (just "hangs"). The only
evidence of a problem other than the queries hanging was this in the
log file:

failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted. The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

[2011-08-11 01:07:58,378][WARN ][indices.cluster ] [Unuscione,
Angelo] [index0006][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[index0006][0] shard allocated for local recovery (post api), should exists,
but doesn't

Is there a way to predict when we might encounter out-of-memory
errors? I can enforce better queries with the user interface if there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually. The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.

kimchy · August 12, 2011, 9:58pm

Yes, in 0.17.5 there is a fix for a case where data could be lost (in a
single node / no replicas case). I've ran several long running failure
driven tests simulating those failures, and could not get to a state where
data was loss.

On Fri, Aug 12, 2011 at 7:57 PM, Ben Coe ben@attachments.me wrote:

Is this problem fixed in the 0.17.5 release. If so, I'll upgrade to it
ASAP

Thanks for the hard work Shay,

-- Ben.

On Aug 11, 10:48 am, Shay Banon kim...@gmail.com wrote:

Yes, even with a no replicas index (single node or several), you should
not
loose data in those failure cases.

On Thu, Aug 11, 2011 at 8:44 PM, Tom Le dot...@gmail.com wrote:

The logs don't show anything before the OOM error, upon restart there
was a
"failed to start shard".

That issue looks very similar so it's likely the case. I will test
some
OOM conditions with the new version and gist any problems. The 3 main
scenarios I want to ensure reliable recovery are:

OOM

too many open files

file system full

In theory, proper health monitoring would mean we don't encounter these
scenarios. I know you mentioned previously that both Lucene and ES
goes to
great lengths to provide recovery from various failure states.

On Thu, Aug 11, 2011 at 1:26 AM, Shay Banon kim...@gmail.com wrote:

I think you might have hit this one:
Failed shard recovery can cause shard data to be deleted (replicas will still work) · Issue #1227 · elastic/elasticsearch · GitHub, which I
fixed
yesterday. Did shards failed before you shutdown the node?

On Thu, Aug 11, 2011 at 4:15 AM, Tom Le dot...@gmail.com wrote:

I encountered lost data as a result of java.lang.OutOfMemoryError
condition. I am using:

0.17.4

4g min/max heap size

mlockall

single node, 1 shard per index, 0 replicas

indexed 300m documents across 6 indexes

config:

index:
number_of_shards: 1
number_of_replicas: 0
refresh_interval: 20s

boostrap:
mlockall: true

I have been indexing about 3000 messages per second constantly for
over 80-hours, then at the 300m document level, I ran this query and
the query hung waiting for a response:

curl -XGEThttp://localhost:9200/_search-d '{"query":
{"query_string":{"query":"someword"}}}'

Where someword was contained in all 300m documents. Previous to
this,
other queries returned results fine. Queries to the health endpoint
and any other query did not return a result (just "hangs"). The only
evidence of a problem other than the queries hanging was this in the
log file:

failed engine java.lang.OutOfMemoryError: Java heap space

A normal "kill PID" command did not initiate a graceful shudown and I
had to perform a "kill -9".

"lsof" showed 26k files open (vs. 1300 without ES running), and
ulimit
is set at 100000.

Upon restart, with updated min/max heap size to 8g, all of the files
in the ./index and ./translog directories were deleted. The cluster
won't start up and shows a constant stream of errors like this and
this is continuous without stopping:

[2011-08-11 01:07:58,378][WARN ][indices.cluster ]
[Unuscione,
Angelo] [index0006][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[index0006][0] shard allocated for local recovery (post api), should
exists,
but doesn't

Is there a way to predict when we might encounter out-of-memory
errors? I can enforce better queries with the user interface if
there
are some queries that we should not be executing.

Is there a better way to startup ES if I know I previously had an
unclean shutdown?

I am expecting each node to have at least 10 billion documents across
100 to 300 indexes eventually. The document sizes are small, ranging
from 200 to 800 bytes, plus an additional 200 bytes for the fields.

Topic		Replies	Views
Elasticsearch cluster down due to Elasticsearch:java.lang.OutOfMemoryError: Java heap space Elasticsearch	4	374	June 25, 2019
java.lang.OutOfMemoryError Elasticsearch	2	1961	July 5, 2017
ElasticSearch Recovery after data loss due to an Out of memory error Elasticsearch	3	571	July 6, 2017
Best way to resolve this out of memory error in my ES Data Nodes Elasticsearch	3	880	March 6, 2021
Out of heap error on machines with 18GB heap and 6GB index Elasticsearch	2	479	July 6, 2017

Lost data due to out of memory error

Related topics