OOM on Cold Cluster Start


(David Kleiner) #1

Hi,

With the latest stable ES, I'm getting OOM on cold cluster start with heap
size under 25-30G. I had to find a really beefy box to get the cluster up
and running and then bind two more 10G heap ES nodes to it.

Once the cluster is operational, heap pressure stays under 10G. I have
2-way cluster, with a single data-less gateway, 40 indices (mostly user
logs fed by logstash, split by month), 392 total shards (both nodes),
about 220G total space, 110G/node. I kept the default 5-shards / index.

Recovery on cold start was really painful and took hours of downtime until
I found a big temporary node.

Any recommendations to avoid this situation on the next cold start
appreciated!

Thank you,

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Kleiner) #2

stack trace on cold start (single node, unbound to the cluster):

[2013-11-14 15:12:47,626][WARN ][index.engine.robin ] [Typeface]
[eventlog-2013.11][0] failed to prepare/warm
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.search.SearchService$IndexReaderWarmer.warm(SearchService.java:649)
at
org.elasticsearch.indices.warmer.InternalIndicesWarmer.warm(InternalIndicesWarmer.java:90)
at
org.elasticsearch.index.engine.robin.RobinEngine$RobinSearchFactory.newSearcher(RobinEngine.java:1622)
at
org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:155)
at
org.apache.lucene.search.SearcherManager.(SearcherManager.java:89)
at
org.elasticsearch.index.engine.robin.RobinEngine.buildSearchManager(RobinEngine.java:1505)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:280)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:660)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:201)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2013-11-14 15:12:50,555][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] sending failed shard for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2013-11-14 15:12:50,556][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] received shard failed for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]

On Thursday, November 14, 2013 1:04:23 PM UTC-8, David Kleiner wrote:

Hi,

With the latest stable ES, I'm getting OOM on cold cluster start with heap
size under 25-30G. I had to find a really beefy box to get the cluster up
and running and then bind two more 10G heap ES nodes to it.

Once the cluster is operational, heap pressure stays under 10G. I have
2-way cluster, with a single data-less gateway, 40 indices (mostly user
logs fed by logstash, split by month), 392 total shards (both nodes),
about 220G total space, 110G/node. I kept the default 5-shards / index.

Recovery on cold start was really painful and took hours of downtime until
I found a big temporary node.

Any recommendations to avoid this situation on the next cold start
appreciated!

Thank you,

David

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #3

Hey,

are you using elasticsearch 0.90.6 or 0.90.7?
Elasticsearch 0.90.7 was released yesterday to fix an issue, which might
cause OOMs in that particular setup, see the release blog post at

--Alex

On Thu, Nov 14, 2013 at 10:15 PM, David Kleiner david.kleiner@gmail.comwrote:

stack trace on cold start (single node, unbound to the cluster):

[2013-11-14 15:12:47,626][WARN ][index.engine.robin ] [Typeface]
[eventlog-2013.11][0] failed to prepare/warm
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.search.SearchService$IndexReaderWarmer.warm(SearchService.java:649)
at
org.elasticsearch.indices.warmer.InternalIndicesWarmer.warm(InternalIndicesWarmer.java:90)
at
org.elasticsearch.index.engine.robin.RobinEngine$RobinSearchFactory.newSearcher(RobinEngine.java:1622)
at
org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:155)
at
org.apache.lucene.search.SearcherManager.(SearcherManager.java:89)
at
org.elasticsearch.index.engine.robin.RobinEngine.buildSearchManager(RobinEngine.java:1505)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:280)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:660)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:201)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2013-11-14 15:12:50,555][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] sending failed shard for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2013-11-14 15:12:50,556][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] received shard failed for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]

On Thursday, November 14, 2013 1:04:23 PM UTC-8, David Kleiner wrote:

Hi,

With the latest stable ES, I'm getting OOM on cold cluster start with
heap size under 25-30G. I had to find a really beefy box to get the
cluster up and running and then bind two more 10G heap ES nodes to it.

Once the cluster is operational, heap pressure stays under 10G. I have
2-way cluster, with a single data-less gateway, 40 indices (mostly user
logs fed by logstash, split by month), 392 total shards (both nodes),
about 220G total space, 110G/node. I kept the default 5-shards / index.

Recovery on cold start was really painful and took hours of downtime
until I found a big temporary node.

Any recommendations to avoid this situation on the next cold start
appreciated!

Thank you,

David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Kleiner) #4

That was 0.90.6, I'll give 0.90.7 a try, thank you Alex!

On Thursday, November 14, 2013 1:26:59 PM UTC-8, Alexander Reelsen wrote:

Hey,

are you using elasticsearch 0.90.6 or 0.90.7?
Elasticsearch 0.90.7 was released yesterday to fix an issue, which might
cause OOMs in that particular setup, see the release blog post at
http://www.elasticsearch.org/blog/0-90-7-released/

--Alex

On Thu, Nov 14, 2013 at 10:15 PM, David Kleiner <david....@gmail.com<javascript:>

wrote:

stack trace on cold start (single node, unbound to the cluster):

[2013-11-14 15:12:47,626][WARN ][index.engine.robin ] [Typeface]
[eventlog-2013.11][0] failed to prepare/warm
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.search.SearchService$IndexReaderWarmer.warm(SearchService.java:649)
at
org.elasticsearch.indices.warmer.InternalIndicesWarmer.warm(InternalIndicesWarmer.java:90)
at
org.elasticsearch.index.engine.robin.RobinEngine$RobinSearchFactory.newSearcher(RobinEngine.java:1622)
at
org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:155)
at
org.apache.lucene.search.SearcherManager.(SearcherManager.java:89)
at
org.elasticsearch.index.engine.robin.RobinEngine.buildSearchManager(RobinEngine.java:1505)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:280)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:660)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:201)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2013-11-14 15:12:50,555][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] sending failed shard for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2013-11-14 15:12:50,556][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] received shard failed for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]

On Thursday, November 14, 2013 1:04:23 PM UTC-8, David Kleiner wrote:

Hi,

With the latest stable ES, I'm getting OOM on cold cluster start with
heap size under 25-30G. I had to find a really beefy box to get the
cluster up and running and then bind two more 10G heap ES nodes to it.

Once the cluster is operational, heap pressure stays under 10G. I have
2-way cluster, with a single data-less gateway, 40 indices (mostly user
logs fed by logstash, split by month), 392 total shards (both nodes),
about 220G total space, 110G/node. I kept the default 5-shards / index.

Recovery on cold start was really painful and took hours of downtime
until I found a big temporary node.

Any recommendations to avoid this situation on the next cold start
appreciated!

Thank you,

David

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(David Kleiner) #5

Happy to report, with 0.90.7 the nodes came back up fast and with no heap
issues. Guess I can bring the heap size down from 12G to 8G to give more
room the the logstash instances.

Cheers!

David

On Thursday, November 14, 2013 2:01:56 PM UTC-8, David Kleiner wrote:

That was 0.90.6, I'll give 0.90.7 a try, thank you Alex!

On Thursday, November 14, 2013 1:26:59 PM UTC-8, Alexander Reelsen wrote:

Hey,

are you using elasticsearch 0.90.6 or 0.90.7?
Elasticsearch 0.90.7 was released yesterday to fix an issue, which might
cause OOMs in that particular setup, see the release blog post at
http://www.elasticsearch.org/blog/0-90-7-released/

--Alex

On Thu, Nov 14, 2013 at 10:15 PM, David Kleiner david....@gmail.comwrote:

stack trace on cold start (single node, unbound to the cluster):

[2013-11-14 15:12:47,626][WARN ][index.engine.robin ] [Typeface]
[eventlog-2013.11][0] failed to prepare/warm
java.lang.OutOfMemoryError: Java heap space
at
org.elasticsearch.search.SearchService$IndexReaderWarmer.warm(SearchService.java:649)
at
org.elasticsearch.indices.warmer.InternalIndicesWarmer.warm(InternalIndicesWarmer.java:90)
at
org.elasticsearch.index.engine.robin.RobinEngine$RobinSearchFactory.newSearcher(RobinEngine.java:1622)
at
org.apache.lucene.search.SearcherManager.getSearcher(SearcherManager.java:155)
at
org.apache.lucene.search.SearcherManager.(SearcherManager.java:89)
at
org.elasticsearch.index.engine.robin.RobinEngine.buildSearchManager(RobinEngine.java:1505)
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:280)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:660)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:201)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:174)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
[2013-11-14 15:12:50,555][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] sending failed shard for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]
[2013-11-14 15:12:50,556][WARN ][cluster.action.shard ] [Typeface]
[nginx-2013.11][1] received shard failed for [nginx-2013.11][1],
node[nc9mQX0vQzyWWINV0l8t9Q], [P], s[INITIALIZING], indexUUID [na],
reason [Failed to create shard, message
[IndexShardCreationException[[nginx-2013.11][1] failed to create shard];
nested: ExecutionError[java.lang.OutOfMemoryError: Java heap space];
nested: OutOfMemoryError[Java heap space]; ]]

On Thursday, November 14, 2013 1:04:23 PM UTC-8, David Kleiner wrote:

Hi,

With the latest stable ES, I'm getting OOM on cold cluster start with
heap size under 25-30G. I had to find a really beefy box to get the
cluster up and running and then bind two more 10G heap ES nodes to it.

Once the cluster is operational, heap pressure stays under 10G. I have
2-way cluster, with a single data-less gateway, 40 indices (mostly user
logs fed by logstash, split by month), 392 total shards (both nodes),
about 220G total space, 110G/node. I kept the default 5-shards / index.

Recovery on cold start was really painful and took hours of downtime
until I found a big temporary node.

Any recommendations to avoid this situation on the next cold start
appreciated!

Thank you,

David

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6