OOM makes the whole cluster data lost

Here is my setup:

elasticsearch with version 0.19.4
4 es nodes in 4 machines
hosting 3 different indices:
indice 1: 10 shards and 1 replica
indice 2 and 3: 5 shards and 1 replica

the 4 nodes cluster are continuously running indexing and i submit a
normal search request with terms facet, which turn out that the
building of the terms facet trigger an OutOfMemory error:

[2012-06-28 09:43:44,053][WARN ][index.cache.field.data.resident] [Kid
Colt] [i3_product] loading field [deptIds] caused out of memory
failure
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:61)
at org.elasticsearch.index.field.data.longs.LongFieldData.load(LongFieldData.java:166)

Since this search request happens in multiple threads at the same
time, i.e. 200 search requests are submitted at the same time to the
cluster and finally all nodes are showing continuously the above OOM
errors.

Then I restarted all 4 es nodes one by one and all the indices are
"lost", the whole cluster meta seems to be cleared out though the data
files still exists in each of es node directory.

"dangling index" message appears when I restarted the nodes:
dangling index, exists on local file system, but not in cluster
metadata, scheduling to delete in [2h]

So my questions are:

  1. is the above behavior expected? how can I recover the cluster data?

  2. I searched this thread:
    https://groups.google.com/d/msg/elasticsearch/sQCYHEdamJc/igf_DEICFmwJ
    and it talks about "configure the VM to exit in case of OOM", how can
    we configure VM to exist in case of OOM?

  3. can es do something to prevent this kind of OOM caused by search
    query? because we may not be able to determine if the incoming search
    query can cause OOM and thus bring diaster to the system

Thanks,
Wing

Wing,

Look for "Elasticsearch Cache Usage" under
Elasticsearch Consulting - Sematext - it may help.
This may help as well:

-XX:+HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/usr/share/es/hs_err_pid.log

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, June 28, 2012 6:30:48 AM UTC-4, Yiu Wing TSANG wrote:

Here is my setup:

elasticsearch with version 0.19.4
4 es nodes in 4 machines
hosting 3 different indices:
indice 1: 10 shards and 1 replica
indice 2 and 3: 5 shards and 1 replica

the 4 nodes cluster are continuously running indexing and i submit a
normal search request with terms facet, which turn out that the
building of the terms facet trigger an OutOfMemory error:

[2012-06-28 09:43:44,053][WARN ][index.cache.field.data.resident] [Kid
Colt] [i3_product] loading field [deptIds] caused out of memory
failure
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:61)

    at 

org.elasticsearch.index.field.data.longs.LongFieldData.load(LongFieldData.java:166)

Since this search request happens in multiple threads at the same
time, i.e. 200 search requests are submitted at the same time to the
cluster and finally all nodes are showing continuously the above OOM
errors.

Then I restarted all 4 es nodes one by one and all the indices are
"lost", the whole cluster meta seems to be cleared out though the data
files still exists in each of es node directory.

"dangling index" message appears when I restarted the nodes:
dangling index, exists on local file system, but not in cluster
metadata, scheduling to delete in [2h]

So my questions are:

  1. is the above behavior expected? how can I recover the cluster data?

  2. I searched this thread:
    https://groups.google.com/d/msg/elasticsearch/sQCYHEdamJc/igf_DEICFmwJ
    and it talks about "configure the VM to exit in case of OOM", how can
    we configure VM to exist in case of OOM?

  3. can es do something to prevent this kind of OOM caused by search
    query? because we may not be able to determine if the incoming search
    query can cause OOM and thus bring diaster to the system

Thanks,
Wing

Thanks for your information.

I read this slide:

and I think this slide 27 can solve OOM about "too much facets" and it
recommends to set index.cache.type to soft

So I try to check what exactly "soft" means at:

This doc just mention the index.cache.type can be resident, soft and
weak, but no more explanation, which seems to be trivial to others?

Can I have a brief details about the differences of these 3 different
types of cache? resident, soft and weak

Thanks,
Wing

On Thu, Jun 28, 2012 at 10:06 PM, Otis Gospodnetic
otis.gospodnetic@gmail.com wrote:

Wing,

Look for "Elasticsearch Cache Usage" under
Elasticsearch Consulting - Sematext - it may help.
This may help as well:

-XX:+HeapDumpOnOutOfMemoryError
-XX:ErrorFile=/usr/share/es/hs_err_pid.log

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Scalable Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Thursday, June 28, 2012 6:30:48 AM UTC-4, Yiu Wing TSANG wrote:

Here is my setup:

elasticsearch with version 0.19.4
4 es nodes in 4 machines
hosting 3 different indices:
indice 1: 10 shards and 1 replica
indice 2 and 3: 5 shards and 1 replica

the 4 nodes cluster are continuously running indexing and i submit a
normal search request with terms facet, which turn out that the
building of the terms facet trigger an OutOfMemory error:

[2012-06-28 09:43:44,053][WARN ][index.cache.field.data.resident] [Kid
Colt] [i3_product] loading field [deptIds] caused out of memory
failure
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.index.field.data.support.FieldDataLoader.load(FieldDataLoader.java:61)
at
org.elasticsearch.index.field.data.longs.LongFieldData.load(LongFieldData.java:166)

Since this search request happens in multiple threads at the same
time, i.e. 200 search requests are submitted at the same time to the
cluster and finally all nodes are showing continuously the above OOM
errors.

Then I restarted all 4 es nodes one by one and all the indices are
"lost", the whole cluster meta seems to be cleared out though the data
files still exists in each of es node directory.

"dangling index" message appears when I restarted the nodes:
dangling index, exists on local file system, but not in cluster
metadata, scheduling to delete in [2h]

So my questions are:

  1. is the above behavior expected? how can I recover the cluster data?

  2. I searched this thread:
    https://groups.google.com/d/msg/elasticsearch/sQCYHEdamJc/igf_DEICFmwJ
    and it talks about "configure the VM to exit in case of OOM", how can
    we configure VM to exist in case of OOM?

  3. can es do something to prevent this kind of OOM caused by search
    query? because we may not be able to determine if the incoming search
    query can cause OOM and thus bring diaster to the system

Thanks,
Wing

The docs are a little bit short on that topic. In fact, the
index.cache.type specifies the type of Java object references stored in the
cache.

For the different characteristics between "resident" (strong), soft, and
weak references for garbage collection, please see this nice blog entry:

http://weblogs.java.net/blog/enicholas/archive/2006/05/understanding_w.html

Best regards,

Jörg