Hi all,
I'm hoping someone can help me piece together the below log entries/stack
traces/Exceptions. I have a 3 node cluster in Development in EC2, and two
of them had issues. I'm running ES 1.4.4, 32GB RAM, 16GB heaps, dedicated
servers to ES. My idex rate averages about 10k/sec. There were no
searches going on at the time of the incident.
It appears to me that node 10.0.0.12 began timing out requests to 10.0.45,
indicating that 10.0.0.45 was having issues.
Then at 4:36, 10.0.0.12 logs the ERROR about "Uncaught exception:
IndexWriter already closed", caused by an OOME.
Then at 4:43, 10.0.0.45 hits the "Create failed" WARN, and logs an OOME.
Then things are basically down and unresponsive.
What is weird to me is that if 10.0.0.45 was the node having issues, why
did 10.0.0.12 log an exception 7 minutes before that? Did both nodes run
out of memory? Or is one of the Exceptions actually saying, "I see that
this other node hit an OOME, and I'm telling you about it."
I have a few values tweaked in the elasticsearch.yml file to try and keep
this from happening (configured from Puppet):
'indices.breaker.fielddata.limit' => '20%',
'indices.breaker.total.limit' => '25%',
'indices.breaker.request.limit' => '10%',
'index.merge.scheduler.type' => 'concurrent',
'index.merge.scheduler.max_thread_count' => '1',
'index.merge.policy.type' => 'tiered',
'index.merge.policy.max_merged_segment' => '1gb',
'index.merge.policy.segments_per_tier' => '4',
'index.merge.policy.max_merge_at_once' => '4',
'index.merge.policy.max_merge_at_once_explicit' => '4',
'indices.memory.index_buffer_size' => '10%',
'indices.store.throttle.type' => 'none',
'index.translog.flush_threshold_size' => '1GB',
I have done a fair bit of reading on this, and have tried about everything
I can think of.
Can anyone tell me what caused this scenario, and what can be done to avoid
it?
Thank you so much for taking the time to read this.
Chris
=====
On server 10.0.0.12 http://10.0.0.12:
[2015-03-04 03:56:12,548][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [20456ms] ago, timed out [5392ms] ago, action
[cluster:monitor/nodes/st
ats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70061596]
[2015-03-04 04:06:02,407][INFO ][index.engine.internal ]
[elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] now throttling
indexing: numMergesInFlight=4, maxNumMerges=3
[2015-03-04 04:06:04,141][INFO ][index.engine.internal ]
[elasticsearch-ip-10-0-0-12] [derbysoft-ihg-20150304][2] stop throttling
indexing: numMergesInFlight=2, maxNumMerges=3
[2015-03-04 04:12:26,194][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [15709ms] ago, timed out [708ms] ago, action
[cluster:monitor/nodes/sta
ts[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70098828]
[2015-03-04 04:23:40,778][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [21030ms] ago, timed out [6030ms] ago, action
[cluster:monitor/nodes/st
ats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70124234]
[2015-03-04 04:24:47,023][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [27275ms] ago, timed out [12275ms] ago, action
[cluster:monitor/nodes/s
tats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70126273]
[2015-03-04 04:25:39,180][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [19431ms] ago, timed out [4431ms] ago, action
[cluster:monitor/nodes/st
ats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70127835]
[2015-03-04 04:26:40,775][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [19241ms] ago, timed out [4241ms] ago, action
[cluster:monitor/nodes/st
ats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70129981]
[2015-03-04 04:27:14,329][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [22676ms] ago, timed out [6688ms] ago, action
[cluster:monitor/nodes/stats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70130668]
[2015-03-04 04:28:15,695][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [24042ms] ago, timed out [9041ms] ago, action
[cluster:monitor/nodes/stats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70132644]
[2015-03-04 04:29:38,102][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [16448ms] ago, timed out [1448ms] ago, action
[cluster:monitor/nodes/stats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70135333]
[2015-03-04 04:33:42,393][WARN ][transport ]
[elasticsearch-ip-10-0-0-12] Received response for a request that has timed
out, sent [20738ms] ago, timed out [5737ms] ago, action
[cluster:monitor/nodes/stats[n]], node
[[elasticsearch-ip-10-0-0-45][i4gmsxs0Q0eyvPWjajNV5A][ip-10-0-0-45.us-west-2.compute.internal][inet[ip-10-0-0-45.us-west-2.compute.internal/10.0.0.45:9300]]{master=true}],
id [70142427]
[2015-03-04 04:36:08,788][ERROR][marvel.agent ]
[elasticsearch-ip-10-0-0-12] Background thread had an uncaught exception:
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698)
at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712)
at
org.apache.lucene.index.IndexWriter.ramBytesUsed(IndexWriter.java:462)
at
org.elasticsearch.index.engine.internal.InternalEngine.segmentsStats(InternalEngine.java:1224)
at
org.elasticsearch.index.shard.service.InternalIndexShard.segmentStats(InternalIndexShard.java:555)
at
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:170)
at
org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
at
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:172)
at
org.elasticsearch.node.service.NodeService.stats(NodeService.java:138)
at
org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:300)
at
org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:225)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
=====
On server 10.0.0.45 http://10.0.0.45:
[2015-03-04 04:43:27,245][WARN ][index.engine.internal ]
[elasticsearch-ip-10-0-0-45] [myindex-20150304][1] failed engine
[indices:data/write/bulk[s] failed on replica]
org.elasticsearch.index.engine.CreateFailedEngineException:
[myindex-20150304][1] Create failed for [my_type#AUvjGHoiku-fZf277h_4]
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:421)
at
org.elasticsearch.index.shard.service.InternalIndexShard.create(InternalIndexShard.java:403)
at
org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:595)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:246)
at
org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$ReplicaOperationTransportHandler.messageReceived(TransportShardReplicationOperationAction.java:225)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexWriter
is closed
at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:698)
at
org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:712)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1507)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1246)
at
org.elasticsearch.index.engine.internal.InternalEngine.innerCreateNoLock(InternalEngine.java:502)
at
org.elasticsearch.index.engine.internal.InternalEngine.innerCreate(InternalEngine.java:444)
at
org.elasticsearch.index.engine.internal.InternalEngine.create(InternalEngine.java:413)
... 8 more
Caused by: java.lang.OutOfMemoryError: Java heap space
=====
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAND3DphzaT3Np5TBW%2B-h_aOo9BScPu_5QO9qCqnYLp__JCjOPA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.