Hi all,
We recently had a cascading cluster failure. From 16:35 to 16:42 the
cluster went red and recovered it self. I can't seem to find any obvious
logs around this time.
The cluster has about 19 nodes. 9 physical boxes running two instances of
elasticsearch. And one vm as balancer for indexing. The CPU is normal and
memory usage is below 75%
Heap during the outage
Heap once stable.
- Below are list of events that happened according to marvel :*
2014-12-23T16:41:22.456-07:00node_eventnode_joined[E0009-1][XX] joined
2014-12-23T16:41:19.439-07:00node_eventnode_left[E0009-1][XX] left
2014-12-23T16:41:19.439-07:00node_eventelected_as_master[E0011-0][XX]
became master
2014-12-23T16:41:04.392-07:00node_eventnode_joined[E0007-0][XX] joined
2014-12-23T16:40:49.176-07:00node_eventnode_joined[E0007-1][XX] joined
2014-12-23T16:40:07.781-07:00node_eventnode_left[E0007-1][XX] left
2014-12-23T16:40:07.781-07:00node_eventelected_as_master[E0010-0][XX]
became master
2014-12-23T16:39:51.802-07:00node_eventnode_left[E0011-1][XX] left
2014-12-23T16:39:05.897-07:00node_eventnode_left[-E0004-0][XX] left
2014-12-23T16:38:39.128-07:00node_eventnode_left[E0007-1][XX] left
2014-12-23T16:38:39.128-07:00node_eventelected_as_master[XX] became master
2014-12-23T16:38:22.445-07:00node_eventnode_left[E0007-1][XX] left
2014-12-23T16:38:19.298-07:00node_eventnode_left[E0007-0][XX] left
2014-12-23T16:32:57.804-07:00node_eventelected_as_master[XX] became master
2014-12-23T16:32:57.804-07:00node_eventnode_left[E0012-0][XX] left
All I can find are some info logs, when the master is elected and "reason:
zen-disco-master_failed"
[2014-12-23 17:32:27,668][INFO ][cluster.service ] [E0007-1]
master {new
[E0007-1][M8pl6CaVTWi73pWLuOFPfQ][E0007][inet[E0007/xxx]]{rack=E0007,
max_local_storage_nodes=2, master=true}, previous
[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true}}, removed
{[E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true},}, reason: zen-disco-master_failed
([E0012-0][JpBCQSK_QKWj84OTzBaOXg][E0012][inet[/xxx]]{rack=E0012,
max_local_storage_nodes=2, master=true})
*I couldn't find any other errors or warnings around this time. All I can
find is OOM errors which I found that is also happening before. *
*I found similar logs in all the nodes just before the node left : *
[2014-12-23 17:38:20,117][WARN ][index.translog ] [E0007-1]
[xxxx70246][10] failed to flush shard on translog threshold
org.elasticsearch.index.engine.FlushFailedEngineException:
[xxxx10170246][10] Flush failed
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:868)
at
org.elasticsearch.index.shard.service.InternalIndexShard.flush(InternalIndexShard.java:609)
at
org.elasticsearch.index.translog.TranslogService$TranslogBasedFlush$1.run(TranslogService.java:201)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: this writer hit an
OutOfMemoryError; cannot commit
at
org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2941)
at
org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3122)
*Also, found some transport expections, which are not new *
2014-12-23 17:37:52,328][WARN ][search.action ] [E0007-1] Failed
to send release search context
org.elasticsearch.transport.SendRequestTransportException:
[E0012-0][inet[ALLEG-P-E0012/172.16.116.112:9300]][search/freeContext]
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)
at
org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)
at
org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)
at
org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)
at
org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)
at org.elasticsearch.action.search.type.TransportSea
The cluster recovered after the 7 minutes and is back up and green. Can
these errors cause the nodes to not respond making the cluster think the
node is dead and elect a new master and so forth ? If not I was wondering
If I can get some pointers on where to look ? Or what might have happened.
Thanks,
Abhishek
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/832af056-f21c-438c-97be-3794e97549b1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.