OOM makes the whole cluster data lost

ywtsang · June 28, 2012, 10:30am

Here is my setup:

elasticsearch with version 0.19.4
4 es nodes in 4 machines
hosting 3 different indices:
indice 1: 10 shards and 1 replica
indice 2 and 3: 5 shards and 1 replica

the 4 nodes cluster are continuously running indexing and i submit a
normal search request with terms facet, which turn out that the
building of the terms facet trigger an OutOfMemory error:

[2012-06-28 09:43:44,053][WARN ][index.cache.field.data.resident] [Kid
Colt] [i3_product] loading field [deptIds] caused out of memory
failure
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:61)
at org.elasticsearch.index.field.data.longs.LongFieldData.load(LongFieldData.java:166)

Since this search request happens in multiple threads at the same
time, i.e. 200 search requests are submitted at the same time to the
cluster and finally all nodes are showing continuously the above OOM
errors.

Then I restarted all 4 es nodes one by one and all the indices are
"lost", the whole cluster meta seems to be cleared out though the data
files still exists in each of es node directory.

"dangling index" message appears when I restarted the nodes:
dangling index, exists on local file system, but not in cluster
metadata, scheduling to delete in [2h]

So my questions are:

is the above behavior expected? how can I recover the cluster data?
I searched this thread:
https://groups.google.com/d/msg/elasticsearch/sQCYHEdamJc/igf_DEICFmwJ
and it talks about "configure the VM to exit in case of OOM", how can
we configure VM to exist in case of OOM?
can es do something to prevent this kind of OOM caused by search
query? because we may not be able to determine if the incoming search
query can cause OOM and thus bring diaster to the system

Thanks,
Wing

otisg · June 28, 2012, 2:06pm

Wing,

Look for "Elasticsearch Cache Usage" under
Elasticsearch Consulting - Sematext - it may help.
This may help as well:

-XX:+HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/usr/share/es/hs_err_pid.log

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, June 28, 2012 6:30:48 AM UTC-4, Yiu Wing TSANG wrote:

Here is my setup:

elasticsearch with version 0.19.4
4 es nodes in 4 machines
hosting 3 different indices:
indice 1: 10 shards and 1 replica
indice 2 and 3: 5 shards and 1 replica

the 4 nodes cluster are continuously running indexing and i submit a
normal search request with terms facet, which turn out that the
building of the terms facet trigger an OutOfMemory error:

[2012-06-28 09:43:44,053][WARN ][index.cache.field.data.resident] [Kid
Colt] [i3_product] loading field [deptIds] caused out of memory
failure
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:61)
    at 
org.elasticsearch.index.field.data.longs.LongFieldData.load(LongFieldData.java:166)

Since this search request happens in multiple threads at the same
time, i.e. 200 search requests are submitted at the same time to the
cluster and finally all nodes are showing continuously the above OOM
errors.

Then I restarted all 4 es nodes one by one and all the indices are
"lost", the whole cluster meta seems to be cleared out though the data
files still exists in each of es node directory.

"dangling index" message appears when I restarted the nodes:
dangling index, exists on local file system, but not in cluster
metadata, scheduling to delete in [2h]

So my questions are:

is the above behavior expected? how can I recover the cluster data?

I searched this thread:
https://groups.google.com/d/msg/elasticsearch/sQCYHEdamJc/igf_DEICFmwJ
and it talks about "configure the VM to exit in case of OOM", how can
we configure VM to exist in case of OOM?

can es do something to prevent this kind of OOM caused by search
query? because we may not be able to determine if the incoming search
query can cause OOM and thus bring diaster to the system

Thanks,
Wing

ywtsang · July 3, 2012, 4:25am

Thanks for your information.

I read this slide:

and I think this slide 27 can solve OOM about "too much facets" and it
recommends to set index.cache.type to soft

So I try to check what exactly "soft" means at:

This doc just mention the index.cache.type can be resident, soft and
weak, but no more explanation, which seems to be trivial to others?

Can I have a brief details about the differences of these 3 different
types of cache? resident, soft and weak

Thanks,
Wing

On Thu, Jun 28, 2012 at 10:06 PM, Otis Gospodnetic
otis.gospodnetic@gmail.com wrote:

Wing,

Look for "Elasticsearch Cache Usage" under
Elasticsearch Consulting - Sematext - it may help.
This may help as well:

-XX:+HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/usr/share/es/hs_err_pid.log

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, June 28, 2012 6:30:48 AM UTC-4, Yiu Wing TSANG wrote:

Here is my setup:

elasticsearch with version 0.19.4
4 es nodes in 4 machines
hosting 3 different indices:
indice 1: 10 shards and 1 replica
indice 2 and 3: 5 shards and 1 replica

the 4 nodes cluster are continuously running indexing and i submit a
normal search request with terms facet, which turn out that the
building of the terms facet trigger an OutOfMemory error:

[2012-06-28 09:43:44,053][WARN ][index.cache.field.data.resident] [Kid
Colt] [i3_product] loading field [deptIds] caused out of memory
failure
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:61)
at
org.elasticsearch.index.field.data.longs.LongFieldData.load(LongFieldData.java:166)

Since this search request happens in multiple threads at the same
time, i.e. 200 search requests are submitted at the same time to the
cluster and finally all nodes are showing continuously the above OOM
errors.

Then I restarted all 4 es nodes one by one and all the indices are
"lost", the whole cluster meta seems to be cleared out though the data
files still exists in each of es node directory.

"dangling index" message appears when I restarted the nodes:
dangling index, exists on local file system, but not in cluster
metadata, scheduling to delete in [2h]

So my questions are:

is the above behavior expected? how can I recover the cluster data?

I searched this thread:
https://groups.google.com/d/msg/elasticsearch/sQCYHEdamJc/igf_DEICFmwJ
and it talks about "configure the VM to exit in case of OOM", how can
we configure VM to exist in case of OOM?

can es do something to prevent this kind of OOM caused by search
query? because we may not be able to determine if the incoming search
query can cause OOM and thus bring diaster to the system

Thanks,
Wing

jprante · July 3, 2012, 10:10pm

The docs are a little bit short on that topic. In fact, the
index.cache.type specifies the type of Java object references stored in the
cache.

For the different characteristics between "resident" (strong), soft, and
weak references for garbage collection, please see this nice blog entry:

http://weblogs.java.net/blog/enicholas/archive/2006/05/understanding_w.html

Best regards,

Jörg

Topic		Replies	Views
Node not available exception is occurred after Out of memory error Elasticsearch	2	567	July 6, 2017
OOM errors, shard destroyed, index destroyed, etc Elasticsearch	4	572	July 6, 2017
Getting OOME's (Out of Memory Exceptions) to stop Elasticsearch	3	1742	July 6, 2017
ElasticSearch OutOfMemory Exceptions Elasticsearch	8	364	July 6, 2017
OOM on aggregation and lot of time out exceptions Elasticsearch	7	1595	July 5, 2017

OOM makes the whole cluster data lost

Otis

Otis

Related topics