We have a problem that seems to have brought down the cluster and it
failed to restart. At some point during the night it seems that a
problem occurred on one of 3 machines in a cluster (which we think was
the master node). Unfortunately, it logged so much that the log files
has rolled and so we can't see exactly what happened at the time. It
seems that when things went wrong with this node the other 2 nodes in
the cluster also started having problems, we see "master_left ...
reason [failed to ping, tried [3] times, each with maximum [30s]
timeout]" in the log files of both at roughly the same time of day
(detailed stack below).
The node that went wrong, which we believe is that master has a very
busy log file full of messages. The JVM continued to run unlike to two
child nodes which the java server wrapper tried, but failed, to
restart. Here are a some of examples from the master node log file
that are repeated a couple of times every second:
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[oztrading-inputs][4] shard allocated for local recovery (post api),
should exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
sending failed shard for [oztrading-inputs][4],
node[hfjEe7hQTUqjJc3JVb_hvw], [P], s[INITIALIZING], reason [Failed to
start shard, message [IndexShardGatewayRecoveryException[[oztrading-
inputs][4] shard allocated for local recovery (post api), should
exists, but doesn't]]]
The two other nodes that appeared to detect the master failure got
restarted by the java-service-wrapper shortly after, but neither has
been able to restart successfully, they both report
MasterNotDiscoveredException. Here are their respective log entries
at the point it all went wrong:
[Luichow, Chan] master_left [[Blindspot][BxtjbSbhThOvMdgg9M73yg][inet[/
169.52.84.57:9310]]], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
[Luichow, Chan] master {new [Luichow, Chan][bfPgqLF2RuuLfuATmdEojA]
[inet[/169.52.84.159:9310]], previous [Blindspot]
[BxtjbSbhThOvMdgg9M73yg][inet[/169.52.84.57:9310]]}, removed
{[Blindspot][BxtjbSbhThOvMdgg9M73yg][inet[/169.52.84.57:9310]],},
reason: zen-disco-master_failed ([Blindspot][BxtjbSbhThOvMdgg9M73yg]
[inet[/169.52.84.57:9310]])
Exception in thread "elasticsearch[Luichow,
Chan]clusterService#updateTask-pool-21-thread-1"
java.lang.NullPointerException
at
org.elasticsearch.cluster.routing.RoutingNode.prettyPrint(RoutingNode.java:
142)
at
org.elasticsearch.cluster.routing.RoutingNodes.prettyPrint(RoutingNodes.java:
241)
at org.elasticsearch.cluster.service.InternalClusterService
$2.run(InternalClusterService.java:210)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)[Luichow, Chan] added
{[Entropic Man][hfjEe7hQTUqjJc3JVb_hvw][inet[/169.52.84.154:9310]],},
reason: zen-disco-receive(join from node[[Entropic Man]
[hfjEe7hQTUqjJc3JVb_hvw][inet[/169.52.84.154:9310]]])
JVM appears hung: Timed out waiting for signal from JVM.
JVM did not exit on request, terminated
[Blindspot] master_left [[Stein, Chase][KAXs-1EjQJmH9btY81b93w][inet[/
169.52.84.154:9310]]], reason [failed to ping, tried [3] times, each
with maximum [30s] timeout]
[Blindspot] master {new [Blindspot][BxtjbSbhThOvMdgg9M73yg][inet[/
169.52.84.57:9310]], previous [Stein, Chase][KAXs-1EjQJmH9btY81b93w]
[inet[/169.52.84.154:9310]]}, removed {[Stein, Chase]
[KAXs-1EjQJmH9btY81b93w][inet[/169.52.84.154:9310]],}, reason: zen-
disco-master_failed ([Stein, Chase][KAXs-1EjQJmH9btY81b93w][inet[/
169.52.84.154:9310]])
Exception in thread "elasticsearch[Blindspot]clusterService#updateTask-
pool-21-thread-1" java.lang.NullPointerException
at
org.elasticsearch.cluster.routing.RoutingNode.prettyPrint(RoutingNode.java:
142)
at
org.elasticsearch.cluster.routing.RoutingNodes.prettyPrint(RoutingNodes.java:
241)
at org.elasticsearch.cluster.service.InternalClusterService
$2.run(InternalClusterService.java:210)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
JVM appears hung: Timed out waiting for signal from JVM.
JVM did not exit on request, terminated
And here's the MasterNotDiscoveredExceptions:
[Kraven the Hunter] {0.18.7}[10064]: initializing ...
[Kraven the Hunter] loaded [], sites []
[Kraven the Hunter] {0.18.7}[10064]: initialized
[Kraven the Hunter] {0.18.7}[10064]: starting ...
[Kraven the Hunter] bound_address {inet[/0.0.0.0:9310]},
publish_address {inet[/169.52.84.159:9310]}
[Kraven the Hunter] waited for 30s and no initial state was set by the
discovery
[Kraven the Hunter] index-uat-apac-cluster/JJzD1JtnRTezPBiTaEAAHw
[Kraven the Hunter] bound_address {inet[/0.0.0.0:9210]},
publish_address {inet[/169.52.84.159:9210]}
[Kraven the Hunter] {0.18.7}[10064]: started
[giraffe.audit.ExceptionAuditEvent] Wrapper failed to start
org.elasticsearch.discovery.MasterNotDiscoveredException:
at
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$3.onTimeout(TransportMasterNodeOperationAction.java:162)
at org.elasticsearch.cluster.service.InternalClusterService
$NotifyTimeout.run(InternalClusterService.java:332)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
[Gravity] {0.18.7}[4396]: initializing ...
[Gravity] loaded [], sites []
[Gravity] {0.18.7}[4396]: initialized
[Gravity] {0.18.7}[4396]: starting ...
[Gravity] bound_address {inet[/0.0.0.0:9310]}, publish_address {inet[/
169.52.84.57:9310]}
[Gravity] waited for 30s and no initial state was set by the discovery
[Gravity] index-uat-apac-cluster/zcNilzlZQtCa8wtfp8HACw
[Gravity] bound_address {inet[/0.0.0.0:9210]}, publish_address {inet[/
169.52.84.57:9210]}
[Gravity] {0.18.7}[4396]: started
Wrapper failed to start
org.elasticsearch.discovery.MasterNotDiscoveredException:
at
org.elasticsearch.action.support.master.TransportMasterNodeOperationAction
$3.onTimeout(TransportMasterNodeOperationAction.java:162)
at org.elasticsearch.cluster.service.InternalClusterService
$NotifyTimeout.run(InternalClusterService.java:332)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
We're running on:
windows server 2003, with a 64bit OS.
24gb RAM
4000mb heap
ES v 0.18.7
jdk1.7.0_02 (64bit)
Doubt this is important but:
- discovery.zen.ping.multicast.enabled is disabled, we specify the
hosts using discovery.zen.ping.unicast.hosts. - elastic.index.cache.field.max_size=1000
- elastic.index.cache.field.expire=5m
- the number of replicas is 1 for all our indexes, and the number of
shards is 5.
At the point it went wrong, all VM had plenty of free heap space and
the machines had plenty of free system memory and plenty of free disk
space. We cannot see how the network was performing at the time.
It also might be worth noting that sometimes that when a node is
deemed to have left the cluster, the java-service-wrapper often kills
the other nodes and reports "JVM appears hung: Timed out waiting for
signal from JVM". Perhaps the re-balancing process is causing this??
Any advice is appreciated.