My cluster periodically gets in a weird state where one of the nodes
(usually the master) gets disconnected from the cluster, which causes the
cluster to be partitioned. The remaining nodes elect a new master, and
sometimes recover without a hiccup, but the original master still thinks it
is the master and is still connected to all the remaining nodes.
I have a script pinging all the nodes with es health every minute, and the
trace of the statuses is interesting:
<only five nodes respond at 03:02, but they all report green with 8 nodes>
2013-09-04 00:03:02 ESClusterName green 8 8 695 1390 0 0 0 node20
2013-09-04 00:03:02 ESClusterName green 8 8 695 1390 0 0 0 node17
2013-09-04 00:03:02 ESClusterName green 8 8 695 1390 0 0 0 node19
2013-09-04 00:03:02 ESClusterName green 8 8 695 1390 0 0 0 node12
2013-09-04 00:03:02 ESClusterName green 8 8 695 1390 0 0 0 node14
<something happens, and two nodes are reporting 7 nodes in the cluster>
2013-09-04 00:04:02 ESClusterName green 8 8 695 1390 0 0 0 node16
2013-09-04 00:04:02 ESClusterName green 8 8 695 1390 0 0 0 node20
2013-09-04 00:04:02 ESClusterName green 8 8 695 1390 0 0 0 node13
2013-09-04 00:04:02 ESClusterName green 8 8 695 1390 0 0 0 node21
2013-09-04 00:04:02 ESClusterName green 8 8 695 1390 0 0 0 node12
2013-09-04 00:04:08 ESClusterName green 8 8 695 1390 0 0 0 node14
2013-09-04 00:04:08 ESClusterName green 7 7 695 1390 0 0 0 node19
2013-09-04 00:04:10 ESClusterName yellow 7 7 695 1220 0 14 156 node17
<now we are in the partitioned state, and node20 is off in neverland>
2013-09-04 00:05:03 ESClusterName green 8 8 695 1390 0 0 0 node20
2013-09-04 00:05:03 ESClusterName yellow 7 7 695 1285 0 14 91 node16
2013-09-04 00:05:03 ESClusterName yellow 7 7 695 1285 0 14 91 node19
2013-09-04 00:05:03 ESClusterName yellow 7 7 695 1285 0 14 91 node13
2013-09-04 00:05:03 ESClusterName yellow 7 7 695 1285 0 14 91 node17
2013-09-04 00:05:03 ESClusterName yellow 7 7 695 1285 0 14 91 node14
2013-09-04 00:05:04 ESClusterName yellow 7 7 695 1286 0 14 90 node21
2013-09-04 00:05:04 ESClusterName yellow 7 7 695 1286 0 14 90 node12
<and now we're in the partitioned state, ad infinitum>
2013-09-04 00:10:02 ESClusterName green 8 8 695 1390 0 0 0 node20
2013-09-04 00:10:03 ESClusterName green 7 7 695 1390 0 0 0 node16
2013-09-04 00:10:03 ESClusterName green 7 7 695 1390 0 0 0 node13
2013-09-04 00:10:03 ESClusterName green 7 7 695 1390 0 0 0 node17
2013-09-04 00:10:03 ESClusterName green 7 7 695 1390 0 0 0 node21
2013-09-04 00:10:03 ESClusterName green 7 7 695 1390 0 0 0 node14
2013-09-04 00:10:03 ESClusterName green 7 7 695 1390 0 0 0 node19
2013-09-04 00:10:05 ESClusterName green 7 7 695 1390 0 0 0 node12
The log file for the missing master contains many instances of errors like
the following, but the really odd thing is that the ip address mentioned in
the message is for another node in the cluster (the master was
172.31.12.20).
[2013-09-04 00:01:07,781][WARN ][discovery.zen.ping.multicast]
[node20.example.com] failed to connect to requesting node
[localhost][SprCZQI6TDuyDEhikklrUg][inet[/172.31.12.12:9301]]{client=true,
data=false}
org.elasticsearch.transport.ConnectTransportException:
[localhost][inet[/172.31.12.12:9301]] connect_timeout[30s]
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:673)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:608)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:578)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at
org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$2.run(MulticastZenPing.java:539)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:679)
The other nodes all have something similar to this:
[2013-09-04 00:04:08,164][INFO ][discovery.zen ]
[node12.example.com] master_left
[[node20.example.com][hhmHf3hrSz6tjSktUN4Krg][inet[/172.31.12.20:9300]]],
reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2013-09-04 00:04:08,208][INFO ][cluster.service ]
[node12.example.com] master {new
[node17.example.com][6lA0f-nmR8etT6GehitTmA][inet[/172.31.12.17:9300]],
previous
[node20.example.com][hhmHf3hrSz6tjSktUN4Krg][inet[/172.31.12.20:9300]]},
removed
{[node20.example.com][hhmHf3hrSz6tjSktUN4Krg][inet[/172.31.12.20:9300]],},
reason: zen-disco-receive(from master
[[node17.example.com][6lA0f-nmR8etT6GehitTmA][inet[/172.31.12.17:9300]]])
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.