Yet another OOME: Java heap space thread :S

Hi everyone,

First off, apologies for the thread. I know OOME discussions are somewhat
overdone in the group, but I need to reach out for some help for this one.

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That means
16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index, on
average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster. No
searches, or percolators, just collecting data.

I have:

  • Tweaked the idex.merge.policy
  • Tweaked the indices.fielddata.breaker.limit and cache.size
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is disabled,
    and all fields in the mapping are set to "not_analyzed".

Here is my complete elasticsearch.yml:

action:
disable_delete_all_indices: true
cluster:
name: elasticsearch-dev
discovery:
zen:
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: 10.0.0.45,10.0.0.41
gateway:
recover_after_nodes: 2
index:
merge:
policy:
max_merge_at_once: 5
max_merged_segment: 15gb
number_of_replicas: 1
number_of_shards: 2
refresh_interval: 60s
indices:
fielddata:
breaker:
limit: 50%
cache:
size: 30%
node:
name: elasticsearch-ip-10-0-0-45
path:
data:
- /usr/local/ebs01/elasticsearch
- /usr/local/ebs02/elasticsearch
threadpool:
bulk:
queue_size: 500
size: 75
type: fixed
get:
queue_size: 200
size: 100
type: fixed
index:
queue_size: 1000
size: 100
type: fixed
search:
queue_size: 200
size: 100
type: fixed

The heap sits about 13GB used. I had been batting OOME exceptions for
awhile, and thought I had it licked, but one just popped up again. My
cluster has been up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm ]
[elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms],
collections [1]/[1s], total [770ms]/[43.2m], memory
[13.4gb]->[13.4gb]/[16gb], all_pools {[young]
[648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old]
[12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of
memory]
[2014-07-30 15:03:10,324][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard
on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140730][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard
for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of
memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot
commit]]]
[2014-07-30 15:34:36,418][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge
exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from
[[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]]]
failed
org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[derbysoft-20140730][1]: Recovery failed from
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/
10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/
10.0.0.41:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R],
s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to
start shard, message [RecoveryFailedException[[derbysoft-20140730][1]:
Recovery failed from
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/
10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/10.0.0.41:9300]]];
nested:
RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution
failed]; nested:
ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem. Can
anyone offer a suggestion?
Thanks for reading this.

Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Why do you start with 8gb HEAP? Can't you give 16gb or so?

/usr/bin/java -Xms8g -Xmx8g

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 juil. 2014 à 19:47, Chris Neal chris.neal@derbysoft.net a écrit :

Hi everyone,

First off, apologies for the thread. I know OOME discussions are somewhat overdone in the group, but I need to reach out for some help for this one.

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That means 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args are:
/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server -XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError

The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index, on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster. No searches, or percolators, just collecting data.

I have:
Tweaked the idex.merge.policy
Tweaked the indices.fielddata.breaker.limit and cache.size
change the index refresh_interval from 1s to 60s
created a default template for the index such that _all is disabled, and all fields in the mapping are set to "not_analyzed".
Here is my complete elasticsearch.yml:

action:
disable_delete_all_indices: true
cluster:
name: elasticsearch-dev
discovery:
zen:
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: 10.0.0.45,10.0.0.41
gateway:
recover_after_nodes: 2
index:
merge:
policy:
max_merge_at_once: 5
max_merged_segment: 15gb
number_of_replicas: 1
number_of_shards: 2
refresh_interval: 60s
indices:
fielddata:
breaker:
limit: 50%
cache:
size: 30%
node:
name: elasticsearch-ip-10-0-0-45
path:
data:
- /usr/local/ebs01/elasticsearch
- /usr/local/ebs02/elasticsearch
threadpool:
bulk:
queue_size: 500
size: 75
type: fixed
get:
queue_size: 200
size: 100
type: fixed
index:
queue_size: 1000
size: 100
type: fixed
search:
queue_size: 200
size: 100
type: fixed

The heap sits about 13GB used. I had been batting OOME exceptions for awhile, and thought I had it licked, but one just popped up again. My cluster has been up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm ] [elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms], collections [1]/[1s], total [770ms]/[43.2m], memory [13.4gb]->[13.4gb]/[16gb], all_pools {[young] [648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old] [12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of memory]
[2014-07-30 15:03:10,324][WARN ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException: [derbysoft-20140730][0] Flush failed
at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an OutOfMemoryError; cannot commit
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot commit]]]
[2014-07-30 15:34:36,418][WARN ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge exception][MergeException[java.lang.OutOfMemoryError: Java heap space]; nested: OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from [[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]]] failed
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [derbysoft-20140730][1] Phase[2] Execution failed
at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] request_id [13988539] timed out after [900000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException: [derbysoft-20140730][1]: Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/10.0.0.41:9300]]
at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
at org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException: [elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [derbysoft-20140730][1] Phase[2] Execution failed
at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] request_id [13988539] timed out after [900000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard ] [elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to start shard, message [RecoveryFailedException[[derbysoft-20140730][1]: Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into [elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/10.0.0.41:9300]]]; nested: RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]]; nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution failed]; nested: ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog] request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem. Can anyone offer a suggestion?
Thanks for reading this.

Chris

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Ooops. Sorry. That was a copy/paste error. It is using 16GB. Here is
the correct process arguments:

/usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch [snip CP]

Thanks!
Chris

On Thu, Jul 31, 2014 at 2:43 AM, David Pilato david@pilato.fr wrote:

Why do you start with 8gb HEAP? Can't you give 16gb or so?

/usr/bin/java -Xms8g -Xmx8g

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 juil. 2014 à 19:47, Chris Neal chris.neal@derbysoft.net a écrit :

Hi everyone,

First off, apologies for the thread. I know OOME discussions are somewhat
overdone in the group, but I need to reach out for some help for this one.

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That means
16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index,
on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster. No
searches, or percolators, just collecting data.

I have:

  • Tweaked the idex.merge.policy
  • Tweaked the indices.fielddata.breaker.limit and cache.size
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is disabled,
    and all fields in the mapping are set to "not_analyzed".

Here is my complete elasticsearch.yml:

action:
disable_delete_all_indices: true
cluster:
name: elasticsearch-dev
discovery:
zen:
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: 10.0.0.45,10.0.0.41
gateway:
recover_after_nodes: 2
index:
merge:
policy:
max_merge_at_once: 5
max_merged_segment: 15gb
number_of_replicas: 1
number_of_shards: 2
refresh_interval: 60s
indices:
fielddata:
breaker:
limit: 50%
cache:
size: 30%
node:
name: elasticsearch-ip-10-0-0-45
path:
data:
- /usr/local/ebs01/elasticsearch
- /usr/local/ebs02/elasticsearch
threadpool:
bulk:
queue_size: 500
size: 75
type: fixed
get:
queue_size: 200
size: 100
type: fixed
index:
queue_size: 1000
size: 100
type: fixed
search:
queue_size: 200
size: 100
type: fixed

The heap sits about 13GB used. I had been batting OOME exceptions for
awhile, and thought I had it licked, but one just popped up again. My
cluster has been up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm ]
[elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms],
collections [1]/[1s], total [770ms]/[43.2m], memory
[13.4gb]->[13.4gb]/[16gb], all_pools {[young]
[648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old]
[12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of
memory]
[2014-07-30 15:03:10,324][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard
on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140730][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard
for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of
memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot
commit]]]
[2014-07-30 15:34:36,418][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge
exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from
[[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]]] failed
org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[derbysoft-20140730][1]: Recovery failed from
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/
10.0.0.41:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R],
s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to
start shard, message [RecoveryFailedException[[derbysoft-20140730][1]:
Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][
ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us
-west-2.compute.internal/10.0.0.41:9300]]]; nested:
RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution
failed]; nested:
ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem. Can
anyone offer a suggestion?
Thanks for reading this.

Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dpi-hcOfbPMFvVoNZTF5pDXcGLHqPazGo2Hj3aGvBSh_pQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Sorry to bump my own thread, but It's been awhile and I was hoping to get
some more eyes on this. I've since added a third node to the cluster to
see if that helps, but it did not. I still see these OOME on merges on any
of the three nodes in the cluster.

I have also increased the shard count to 3 to match the number of nodes in
the cluster.

The error happens on an index that is 44GB in size.

The process in top looks like this:

top - 15:39:59 up 63 days, 18:34, 2 users, load average: 0.77, 0.66, 0.71
Tasks: 343 total, 1 running, 342 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.1%us, 0.1%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 30688804k total, 27996932k used, 2691872k free, 62760k buffers
Swap: 10485752k total, 5832k used, 10479920k free, 9434584k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

29539 elastics 20 0 207g 17g 1.1g S 20.2 60.9 2874:16 java

The process using the 5MB of swap is not elasticsearch, just FYI.

If there is any more information I can provide, please let me know. I'm
getting a bit desperate to get this one resolved!
Thank you so much for your time.
Chris

On Thu, Jul 31, 2014 at 10:06 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Ooops. Sorry. That was a copy/paste error. It is using 16GB. Here is
the correct process arguments:

/usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch [snip CP]

Thanks!
Chris

On Thu, Jul 31, 2014 at 2:43 AM, David Pilato david@pilato.fr wrote:

Why do you start with 8gb HEAP? Can't you give 16gb or so?

/usr/bin/java -Xms8g -Xmx8g

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 juil. 2014 à 19:47, Chris Neal chris.neal@derbysoft.net a écrit :

Hi everyone,

First off, apologies for the thread. I know OOME discussions are
somewhat overdone in the group, but I need to reach out for some help for
this one.

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That
means 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index,
on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster. No
searches, or percolators, just collecting data.

I have:

  • Tweaked the idex.merge.policy
  • Tweaked the indices.fielddata.breaker.limit and cache.size
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is
    disabled, and all fields in the mapping are set to "not_analyzed".

Here is my complete elasticsearch.yml:

action:
disable_delete_all_indices: true
cluster:
name: elasticsearch-dev
discovery:
zen:
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: 10.0.0.45,10.0.0.41
gateway:
recover_after_nodes: 2
index:
merge:
policy:
max_merge_at_once: 5
max_merged_segment: 15gb
number_of_replicas: 1
number_of_shards: 2
refresh_interval: 60s
indices:
fielddata:
breaker:
limit: 50%
cache:
size: 30%
node:
name: elasticsearch-ip-10-0-0-45
path:
data:
- /usr/local/ebs01/elasticsearch
- /usr/local/ebs02/elasticsearch
threadpool:
bulk:
queue_size: 500
size: 75
type: fixed
get:
queue_size: 200
size: 100
type: fixed
index:
queue_size: 1000
size: 100
type: fixed
search:
queue_size: 200
size: 100
type: fixed

The heap sits about 13GB used. I had been batting OOME exceptions for
awhile, and thought I had it licked, but one just popped up again. My
cluster has been up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm ]
[elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms],
collections [1]/[1s], total [770ms]/[43.2m], memory
[13.4gb]->[13.4gb]/[16gb], all_pools {[young]
[648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old]
[12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of
memory]
[2014-07-30 15:03:10,324][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard
on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140730][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard
for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of
memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot
commit]]]
[2014-07-30 15:34:36,418][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge
exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from
[[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]]] failed
org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[derbysoft-20140730][1]: Recovery failed from
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/
10.0.0.41:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R],
s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to
start shard, message [RecoveryFailedException[[derbysoft-20140730][1]:
Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][
ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us
-west-2.compute.internal/10.0.0.41:9300]]]; nested:
RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution
failed]; nested:
ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem. Can
anyone offer a suggestion?
Thanks for reading this.

Chris

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I have very similar cluster setup here (ES 1.3.2, 64G RAM, 3 nodes, Java 8,
G1GC, ~100 shards, ~500g indexes on disk)

This is the culprit

max_merged_segment: 15gb

I recommend

max_merged_segment: 1gb

See also Elasticsearch configuration for high sustainable bulk feed · GitHub (which also holds for ES
1.2 and ES 1.3 - these versions have better OOTB defaults for merge)

With this I can use 8g heap for my workload.

Rule of thumb: at each time your heap must be able to cope with an extra
allocation of max_merged_segment (this is NOT what happens behind the
scene, it is just a rough estimate)

With 15g in your setting, the risk is high to overallocate the heap when
your index gets large.

Jörg

On Wed, Sep 17, 2014 at 5:43 PM, Chris Neal chris.neal@derbysoft.net
wrote:

Sorry to bump my own thread, but It's been awhile and I was hoping to get
some more eyes on this. I've since added a third node to the cluster to
see if that helps, but it did not. I still see these OOME on merges on any
of the three nodes in the cluster.

I have also increased the shard count to 3 to match the number of nodes in
the cluster.

The error happens on an index that is 44GB in size.

The process in top looks like this:

top - 15:39:59 up 63 days, 18:34, 2 users, load average: 0.77, 0.66, 0.71
Tasks: 343 total, 1 running, 342 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.1%us, 0.1%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 30688804k total, 27996932k used, 2691872k free, 62760k buffers
Swap: 10485752k total, 5832k used, 10479920k free, 9434584k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

29539 elastics 20 0 207g 17g 1.1g S 20.2 60.9 2874:16 java

The process using the 5MB of swap is not elasticsearch, just FYI.

If there is any more information I can provide, please let me know. I'm
getting a bit desperate to get this one resolved!
Thank you so much for your time.
Chris

On Thu, Jul 31, 2014 at 10:06 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Ooops. Sorry. That was a copy/paste error. It is using 16GB. Here is
the correct process arguments:

/usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch [snip CP]

Thanks!
Chris

On Thu, Jul 31, 2014 at 2:43 AM, David Pilato david@pilato.fr wrote:

Why do you start with 8gb HEAP? Can't you give 16gb or so?

/usr/bin/java -Xms8g -Xmx8g

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 juil. 2014 à 19:47, Chris Neal chris.neal@derbysoft.net a
écrit :

Hi everyone,

First off, apologies for the thread. I know OOME discussions are
somewhat overdone in the group, but I need to reach out for some help for
this one.

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That
means 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args
are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The index,
on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster.
No searches, or percolators, just collecting data.

I have:

  • Tweaked the idex.merge.policy
  • Tweaked the indices.fielddata.breaker.limit and cache.size
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is
    disabled, and all fields in the mapping are set to "not_analyzed".

Here is my complete elasticsearch.yml:

action:
disable_delete_all_indices: true
cluster:
name: elasticsearch-dev
discovery:
zen:
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: 10.0.0.45,10.0.0.41
gateway:
recover_after_nodes: 2
index:
merge:
policy:
max_merge_at_once: 5
max_merged_segment: 15gb
number_of_replicas: 1
number_of_shards: 2
refresh_interval: 60s
indices:
fielddata:
breaker:
limit: 50%
cache:
size: 30%
node:
name: elasticsearch-ip-10-0-0-45
path:
data:
- /usr/local/ebs01/elasticsearch
- /usr/local/ebs02/elasticsearch
threadpool:
bulk:
queue_size: 500
size: 75
type: fixed
get:
queue_size: 200
size: 100
type: fixed
index:
queue_size: 1000
size: 100
type: fixed
search:
queue_size: 200
size: 100
type: fixed

The heap sits about 13GB used. I had been batting OOME exceptions for
awhile, and thought I had it licked, but one just popped up again. My
cluster has been up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm ]
[elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms],
collections [1]/[1s], total [770ms]/[43.2m], memory
[13.4gb]->[13.4gb]/[16gb], all_pools {[young]
[648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old]
[12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of
memory]
[2014-07-30 15:03:10,324][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard
on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140730][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard
for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of
memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot
commit]]]
[2014-07-30 15:34:36,418][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge
exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from
[[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]]] failed
org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[derbysoft-20140730][1]: Recovery failed from
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/
10.0.0.41:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R],
s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to
start shard, message [RecoveryFailedException[[derbysoft-20140730][1]:
Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][
ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us
-west-2.compute.internal/10.0.0.41:9300]]]; nested:
RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution
failed]; nested:
ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem. Can
anyone offer a suggestion?
Thanks for reading this.

Chris

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Thank you so very much for the reply!

That makes sense. I will look at the gist as well, and make some changes
to test.

Again, thank you for your time. I will report back with some results!
Chris

On Wed, Sep 17, 2014 at 11:36 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I have very similar cluster setup here (ES 1.3.2, 64G RAM, 3 nodes, Java
8, G1GC, ~100 shards, ~500g indexes on disk)

This is the culprit

max_merged_segment: 15gb

I recommend

max_merged_segment: 1gb

See also Elasticsearch configuration for high sustainable bulk feed · GitHub (which also holds for
ES 1.2 and ES 1.3 - these versions have better OOTB defaults for merge)

With this I can use 8g heap for my workload.

Rule of thumb: at each time your heap must be able to cope with an extra
allocation of max_merged_segment (this is NOT what happens behind the
scene, it is just a rough estimate)

With 15g in your setting, the risk is high to overallocate the heap when
your index gets large.

Jörg

On Wed, Sep 17, 2014 at 5:43 PM, Chris Neal chris.neal@derbysoft.net
wrote:

Sorry to bump my own thread, but It's been awhile and I was hoping to get
some more eyes on this. I've since added a third node to the cluster to
see if that helps, but it did not. I still see these OOME on merges on any
of the three nodes in the cluster.

I have also increased the shard count to 3 to match the number of nodes
in the cluster.

The error happens on an index that is 44GB in size.

The process in top looks like this:

top - 15:39:59 up 63 days, 18:34, 2 users, load average: 0.77, 0.66,
0.71
Tasks: 343 total, 1 running, 342 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.1%us, 0.1%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 30688804k total, 27996932k used, 2691872k free, 62760k buffers
Swap: 10485752k total, 5832k used, 10479920k free, 9434584k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

29539 elastics 20 0 207g 17g 1.1g S 20.2 60.9 2874:16 java

The process using the 5MB of swap is not elasticsearch, just FYI.

If there is any more information I can provide, please let me know. I'm
getting a bit desperate to get this one resolved!
Thank you so much for your time.
Chris

On Thu, Jul 31, 2014 at 10:06 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Ooops. Sorry. That was a copy/paste error. It is using 16GB. Here is
the correct process arguments:

/usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch [snip CP]

Thanks!
Chris

On Thu, Jul 31, 2014 at 2:43 AM, David Pilato david@pilato.fr wrote:

Why do you start with 8gb HEAP? Can't you give 16gb or so?

/usr/bin/java -Xms8g -Xmx8g

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 juil. 2014 à 19:47, Chris Neal chris.neal@derbysoft.net a
écrit :

Hi everyone,

First off, apologies for the thread. I know OOME discussions are
somewhat overdone in the group, but I need to reach out for some help for
this one.

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That
means 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args
are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The
index, on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster.
No searches, or percolators, just collecting data.

I have:

  • Tweaked the idex.merge.policy
  • Tweaked the indices.fielddata.breaker.limit and cache.size
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is
    disabled, and all fields in the mapping are set to "not_analyzed".

Here is my complete elasticsearch.yml:

action:
disable_delete_all_indices: true
cluster:
name: elasticsearch-dev
discovery:
zen:
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: 10.0.0.45,10.0.0.41
gateway:
recover_after_nodes: 2
index:
merge:
policy:
max_merge_at_once: 5
max_merged_segment: 15gb
number_of_replicas: 1
number_of_shards: 2
refresh_interval: 60s
indices:
fielddata:
breaker:
limit: 50%
cache:
size: 30%
node:
name: elasticsearch-ip-10-0-0-45
path:
data:
- /usr/local/ebs01/elasticsearch
- /usr/local/ebs02/elasticsearch
threadpool:
bulk:
queue_size: 500
size: 75
type: fixed
get:
queue_size: 200
size: 100
type: fixed
index:
queue_size: 1000
size: 100
type: fixed
search:
queue_size: 200
size: 100
type: fixed

The heap sits about 13GB used. I had been batting OOME exceptions for
awhile, and thought I had it licked, but one just popped up again. My
cluster has been up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm ]
[elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms],
collections [1]/[1s], total [770ms]/[43.2m], memory
[13.4gb]->[13.4gb]/[16gb], all_pools {[young]
[648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old]
[12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of
memory]
[2014-07-30 15:03:10,324][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard
on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140730][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard
for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of
memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot
commit]]]
[2014-07-30 15:34:36,418][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge
exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from
[[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]]] failed
org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by:
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[derbysoft-20140730][1]: Recovery failed from
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us-west-2.compute.internal/
10.0.0.41:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by:
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R],
s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to
start shard, message [RecoveryFailedException[[derbysoft-20140730][1]:
Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][
ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us
-west-2.compute.internal/10.0.0.41:9300]]]; nested:
RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution
failed]; nested:
ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem.
Can anyone offer a suggestion?
Thanks for reading this.

Chris

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dpj5y17t9y2X-KY62SABYsKa5L1ViP0XbQNYDYjnwEu%3DSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

A followup as promised.
5 days of up-time, no OOME yet.
So far so good!
One more update after a few more days....

Again, thank you Jorg! :slight_smile:

On Wed, Sep 17, 2014 at 11:54 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Thank you so very much for the reply!

That makes sense. I will look at the gist as well, and make some changes
to test.

Again, thank you for your time. I will report back with some results!
Chris

On Wed, Sep 17, 2014 at 11:36 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I have very similar cluster setup here (ES 1.3.2, 64G RAM, 3 nodes, Java
8, G1GC, ~100 shards, ~500g indexes on disk)

This is the culprit

max_merged_segment: 15gb

I recommend

max_merged_segment: 1gb

See also Elasticsearch configuration for high sustainable bulk feed · GitHub (which also holds for
ES 1.2 and ES 1.3 - these versions have better OOTB defaults for merge)

With this I can use 8g heap for my workload.

Rule of thumb: at each time your heap must be able to cope with an extra
allocation of max_merged_segment (this is NOT what happens behind the
scene, it is just a rough estimate)

With 15g in your setting, the risk is high to overallocate the heap when
your index gets large.

Jörg

On Wed, Sep 17, 2014 at 5:43 PM, Chris Neal chris.neal@derbysoft.net
wrote:

Sorry to bump my own thread, but It's been awhile and I was hoping to
get some more eyes on this. I've since added a third node to the cluster
to see if that helps, but it did not. I still see these OOME on merges on
any of the three nodes in the cluster.

I have also increased the shard count to 3 to match the number of nodes
in the cluster.

The error happens on an index that is 44GB in size.

The process in top looks like this:

top - 15:39:59 up 63 days, 18:34, 2 users, load average: 0.77, 0.66,
0.71
Tasks: 343 total, 1 running, 342 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.1%us, 0.1%sy, 0.0%ni, 98.8%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Mem: 30688804k total, 27996932k used, 2691872k free, 62760k buffers
Swap: 10485752k total, 5832k used, 10479920k free, 9434584k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

29539 elastics 20 0 207g 17g 1.1g S 20.2 60.9 2874:16 java

The process using the 5MB of swap is not elasticsearch, just FYI.

If there is any more information I can provide, please let me know. I'm
getting a bit desperate to get this one resolved!
Thank you so much for your time.
Chris

On Thu, Jul 31, 2014 at 10:06 AM, Chris Neal chris.neal@derbysoft.net
wrote:

Ooops. Sorry. That was a copy/paste error. It is using 16GB. Here
is the correct process arguments:

/usr/bin/java -Xms16g -Xmx16g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch [snip CP]

Thanks!
Chris

On Thu, Jul 31, 2014 at 2:43 AM, David Pilato david@pilato.fr wrote:

Why do you start with 8gb HEAP? Can't you give 16gb or so?

/usr/bin/java -Xms8g -Xmx8g

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 30 juil. 2014 à 19:47, Chris Neal chris.neal@derbysoft.net a
écrit :

Hi everyone,

First off, apologies for the thread. I know OOME discussions are
somewhat overdone in the group, but I need to reach out for some help for
this one.

I have a 2 node development cluster in EC2 on c3.4xlarge AMIs. That
means 16 vCPUs, 30GB RAM, 1Gb network, and I have 2 500GB EBS volumes for
Elasticsearch data on each AMI.

I'm running Java 1.7.0_55, and using the G1 collector. The Java args
are:

/usr/bin/java -Xms8g -Xmx8g -Xss256k -Djava.awt.headless=true -server
-XX:+UseCompressedOops -XX:+UseG1GC -XX:MaxGCPauseMillis=20
-XX:+DisableExplicitGC -XX:+HeapDumpOnOutOfMemoryError
The index has 2 shards, each with 1 replica.

I have a daily index being filled with application log data. The
index, on average, gets to be about:
486M documents
53.1GB (primary size)
106.2GB (total size)

Other than indexing, there really is nothing going on in the cluster.
No searches, or percolators, just collecting data.

I have:

  • Tweaked the idex.merge.policy
  • Tweaked the indices.fielddata.breaker.limit and cache.size
  • change the index refresh_interval from 1s to 60s
  • created a default template for the index such that _all is
    disabled, and all fields in the mapping are set to "not_analyzed".

Here is my complete elasticsearch.yml:

action:
disable_delete_all_indices: true
cluster:
name: elasticsearch-dev
discovery:
zen:
minimum_master_nodes: 2
ping:
multicast:
enabled: false
unicast:
hosts: 10.0.0.45,10.0.0.41
gateway:
recover_after_nodes: 2
index:
merge:
policy:
max_merge_at_once: 5
max_merged_segment: 15gb
number_of_replicas: 1
number_of_shards: 2
refresh_interval: 60s
indices:
fielddata:
breaker:
limit: 50%
cache:
size: 30%
node:
name: elasticsearch-ip-10-0-0-45
path:
data:
- /usr/local/ebs01/elasticsearch
- /usr/local/ebs02/elasticsearch
threadpool:
bulk:
queue_size: 500
size: 75
type: fixed
get:
queue_size: 200
size: 100
type: fixed
index:
queue_size: 1000
size: 100
type: fixed
search:
queue_size: 200
size: 100
type: fixed

The heap sits about 13GB used. I had been batting OOME exceptions
for awhile, and thought I had it licked, but one just popped up again. My
cluster has been up and running fine for 14 days, and I just got this OOME:

=====
[2014-07-30 11:52:28,394][INFO ][monitor.jvm ]
[elasticsearch-ip-10-0-0-41] [gc][young][1158834][109906] duration [770ms],
collections [1]/[1s], total [770ms]/[43.2m], memory
[13.4gb]->[13.4gb]/[16gb], all_pools {[young]
[648mb]->[8mb]/[0b]}{[survivor] [0b]->[0b]/[0b]}{[old]
[12.8gb]->[13.4gb]/[16gb]}
[2014-07-30 15:03:01,070][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed engine [out of
memory]
[2014-07-30 15:03:10,324][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,335][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:10,324][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:03:28,595][WARN ][index.translog ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] failed to flush shard
on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[derbysoft-20140730][0] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:805)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:604)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:202)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4416)
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2989)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3096)
at org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3063)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:797)
... 5 more
[2014-07-30 15:03:28,658][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][0] sending failed shard
for [derbysoft-20140730][0], node[W-7FsjjZTyOXZdaJhhqxEA], [R], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [out of
memory][IllegalStateException[this writer hit an OutOfMemoryError; cannot
commit]]]
[2014-07-30 15:34:36,418][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:39,847][WARN
][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the
selector loop.
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.merge.scheduler ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to merge
java.lang.OutOfMemoryError: Java heap space
[2014-07-30 15:34:42,873][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed engine [merge
exception]
[2014-07-30 15:34:43,185][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [P], s[STARTED],
indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [engine failure, message [merge
exception][MergeException[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2014-07-30 15:57:42,531][WARN ][indices.recovery ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] recovery from
[[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]]] failed
org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by:
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,534][WARN ][indices.cluster ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException:
[derbysoft-20140730][1]: Recovery failed from
[elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][ip-10-0-0-45.us
-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us
-west-2.compute.internal/10.0.0.41:9300]]
at
org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
at
org.elasticsearch.indices.recovery.RecoveryTarget.access$300(RecoveryTarget.java:65)
at
org.elasticsearch.indices.recovery.RecoveryTarget$3.run(RecoveryTarget.java:184)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: org.elasticsearch.transport.RemoteTransportException:
[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300
]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException:
[derbysoft-20140730][1] Phase[2] Execution failed
at
org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1011)
at
org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:631)
at
org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:122)
at
org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:62)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:351)
at
org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by:
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:369)
... 3 more
[2014-07-30 15:57:42,535][WARN ][cluster.action.shard ]
[elasticsearch-ip-10-0-0-41] [derbysoft-20140730][1] sending failed shard
for [derbysoft-20140730][1], node[W-7FsjjZTyOXZdaJhhqxEA], [R],
s[INITIALIZING], indexUUID [QC5Sg0FDSnOGUiFg30qNxA], reason [Failed to
start shard, message [RecoveryFailedException[[derbysoft-20140730][1]:
Recovery failed from [elasticsearch-ip-10-0-0-45][AjN-6_DHQK6B8NJgfphMvA][
ip-10-0-0-45.us-west-2.compute.internal][inet[/10.0.0.45:9300]] into
[elasticsearch-ip-10-0-0-41][W-7FsjjZTyOXZdaJhhqxEA][ip-10-0-0-41.us
-west-2.compute.internal][inet[ip-10-0-0-41.us
-west-2.compute.internal/10.0.0.41:9300]]]; nested:
RemoteTransportException[[elasticsearch-ip-10-0-0-45][inet[/10.0.0.45:9300]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[derbysoft-20140730][1] Phase[2] Execution
failed]; nested:
ReceiveTimeoutTransportException[[elasticsearch-ip-10-0-0-41][inet[/10.0.0.41:9300]][index/shard/recovery/prepareTranslog]
request_id [13988539] timed out after [900000ms]]; ]]

=====

I'm a bit at a loss as to what to try next to address this problem.
Can anyone offer a suggestion?
Thanks for reading this.

Chris

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3DpjXh8Bcrfuy6GLvFOoDH7yUMLr1%3DA2Sb0SFQx6PgkQEvA%40mail.gmail.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/A258FFB5-72F5-40C2-A75C-BFF5FA3B8314%40pilato.fr?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAND3Dpg3UJCu60NDa01iUYkF8sUQAxSpon6-nE86%2B3%2BbY3w4Og%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEt2E5eGOdMkNDERuASsP6wYEONtkoWBYGdoMZRnk3SdA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3Dpj84Lw526N6v3TsN27oFTOavbuhXo%2BHpF8gs2EVx1RVJA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

I am curious. Is the good news still good?

I'm afraid it's the use of compressed oops (-XX:+UseCompressedOops) are also causing the OOME. Can you check running your cluster without that?

It is, actually. I have had no more OOMEs since this last posting. Still running great for me. I'm up to 1.6.0 in Production currently, and 1.7.1 in Dev. Soon 1.7.1 will go to prod as well, since I have had no issues with it.

Hope that helps!

1 Like

Would you care to elaborate on this? What about the UseCompressedOops would be causing OOME? I have had no issues since my last posting, but perhaps I have just been lucky. I am not aware of any potential issues with CompressedOops!

Thanks,
Chris

Please try to run without -XX:+DisableExplicitGC

I'm confused. I'm not having any OOME problems, and now it has been suggested to both run without UseCompressedOops, and without DisableExplicitGC, but no reasons as to why I should try these things.

If things are working fine, and have been for at least 75 days of uptime on my cluster, I'm hard pressed to go change these parameters, especially without an explanation. :wink:

Please elaborate for me?
Thanks!
Chris