We upgraded two clusters from 0.17.8 to 0.18.7 that were using unicast
(due to network restrictions) and ended up in a split-brain
situation. The nodes would only discover other nodes with the same
version as themselves.
The node that was upgraded to 0.18.7 first started spewing these
errors it its error log when it came back up:
[2012-01-11 10:03:53,674][INFO ][node ]
[search-01] {elasticsearch/0.17.8}[19888]: stopping ...
[2012-01-11 10:03:53,929][INFO ][node ]
[search-01] {elasticsearch/0.17.8}[19888]: stopped
[2012-01-11 10:03:53,929][INFO ][node ]
[search-01] {elasticsearch/0.17.8}[19888]: closing ...
[2012-01-11 10:03:54,228][INFO ][node ]
[search-01] {elasticsearch/0.17.8}[19888]: closed
[2012-01-11 10:05:49,512][INFO ][node ]
[search-01] {0.18.7}[7135]: initializing ...
[2012-01-11 10:05:49,519][INFO ][plugins ]
[search-01] loaded [], sites [bigdesk, head]
[2012-01-11 10:05:50,776][INFO ][node ]
[search-01] {0.18.7}[7135]: initialized
[2012-01-11 10:05:50,776][INFO ][node ]
[search-01] {0.18.7}[7135]: starting ...
[2012-01-11 10:05:50,836][INFO ][transport ]
[search-01] bound_address {inet[/0:0:0:0:0:0:0:0:9300]},
publish_address {inet[/10.147.174.140:9300]}
[2012-01-11 10:05:50,970][WARN ][discovery.zen.ping.unicast]
[search-01] failed to send ping to
[[#zen_unicast_4#][inet[search-04.example.com/10.147.176.146:9300]]]
org.elasticsearch.transport.RemoteTransportException:
[search-04][inet[/10.147.176.146:9300]][discovery/zen/unicast]
Caused by: java.io.EOFException
at org.elasticsearch.common.io.stream.StreamInput.readBoolean(StreamInput.java:229)
at org.elasticsearch.discovery.zen.ping.ZenPing$PingResponse.readFrom(ZenPing.java:89)
at org.elasticsearch.discovery.zen.ping.ZenPing$PingResponse.readPingResponse(ZenPing.java:82)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$UnicastPingRequest.readFrom(UnicastZenPing.java:400)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:182)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:87)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:302)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:317)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:299)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:216)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:80)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:783)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:65)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:274)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:261)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:349)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.processSelectedKeys(NioWorker.java:280)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:200)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:44)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Then as other upgraded nodes join the cluster on 0.18.7 it starts
discovering them:
[2012-01-11 10:05:53,898][INFO ][cluster.service ]
[search-01] new_master
[search-01][M3fXnCk3TxS6T-GPjfgYQg][inet[/10.147.174.140:9300]],
reason: zen-disco-join (elected_as_master)
[2012-01-11 10:05:53,918][INFO ][discovery ]
[search-01] example/M3fXnCk3TxS6T-GPjfgYQg
[2012-01-11 10:05:54,055][INFO ][http ]
[search-01] bound_address {inet[/0:0:0:0:0:0:0:0:9200]},
publish_address {inet[/10.147.174.140:9200]}
[2012-01-11 10:05:54,055][INFO ][node ]
[search-01] {0.18.7}[7135]: started
[2012-01-11 10:05:54,078][INFO ][gateway ]
[search-01] recovered [2] indices into cluster_state
[2012-01-11 10:05:54,080][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [0] for indices
[search_engine_20120110_104503]
[2012-01-11 10:05:54,955][INFO ][cluster.metadata ]
[search-01] [search_engine_20120110_104503] auto expanded replicas to
[0]
[2012-01-11 10:05:54,957][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [0] for indices
[search_engine_20120109_104503]
[2012-01-11 10:05:54,961][INFO ][cluster.metadata ]
[search-01] [search_engine_20120109_104503] auto expanded replicas to
[0]
[2012-01-11 10:05:54,962][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [0] for indices
[search_engine_20120109_104503]
[2012-01-11 10:05:54,966][INFO ][cluster.metadata ]
[search-01] [search_engine_20120109_104503] auto expanded replicas to
[0]
[2012-01-11 10:06:08,316][INFO ][cluster.service ]
[search-01] added
{[search-02][exui0NkQQNyjngxQlyX4PQ][inet[/10.147.176.140:9300]],},
reason: zen-disco-receive(join from
node[[search-02][exui0NkQQNyjngxQlyX4PQ][inet[/10.147.176.140:9300]]])
[2012-01-11 10:06:08,320][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [1] for indices
[search_engine_20120110_104503]
[2012-01-11 10:06:08,382][INFO ][cluster.metadata ]
[search-01] [search_engine_20120110_104503] auto expanded replicas to
[1]
[2012-01-11 10:06:08,382][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [1] for indices
[search_engine_20120109_104503]
[2012-01-11 10:06:08,387][INFO ][cluster.metadata ]
[search-01] [search_engine_20120109_104503] auto expanded replicas to
[1]
[2012-01-11 10:06:08,388][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [1] for indices
[search_engine_20120109_104503]
[2012-01-11 10:06:08,392][INFO ][cluster.metadata ]
[search-01] [search_engine_20120109_104503] auto expanded replicas to
[1]
[2012-01-11 10:07:44,837][INFO ][cluster.service ]
[search-01] added
{[search-03][xG_BCl-uTz69cwBslygwsQ][inet[/10.147.174.142:9300]],},
reason: zen-disco-receive(join from
node[[search-03][xG_BCl-uTz69cwBslygwsQ][inet[/10.147.174.142:9300]]])
[2012-01-11 10:07:44,839][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [2] for indices
[search_engine_20120110_104503]
[2012-01-11 10:07:44,896][INFO ][cluster.metadata ]
[search-01] [search_engine_20120110_104503] auto expanded replicas to
[2]
[2012-01-11 10:07:44,897][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [2] for indices
[search_engine_20120109_104503]
[2012-01-11 10:07:44,902][INFO ][cluster.metadata ]
[search-01] [search_engine_20120109_104503] auto expanded replicas to
[2]
[2012-01-11 10:07:44,902][INFO ][cluster.metadata ]
[search-01] updating number_of_replicas to [2] for indices
[search_engine_20120109_104503]
[2012-01-11 10:07:44,907][INFO ][cluster.metadata ]
[search-01] [search_engine_20120109_104503] auto expanded replicas to
[2]
[2012-01-11 10:07:45,651][DEBUG][action.admin.cluster.node.info]
[search-01] failed to execute on node [xG_BCl-uTz69cwBslygwsQ]
Meanwhile the nodes still on 0.17.8 show "Message not fully read"
messages before they too go down for a 0.18.7 upgrade:
[2012-01-11 10:03:53,928][INFO ][cluster.service ]
[search-03] removed
{[search-01][VGeQKtjoT4axRLKI5K_EWQ][inet[/10.147.174.140:9300]],},
reason: zen-disco-receive(from master
[[search-02][LWPgo4rHRASq1nrCI2ZRAA][inet[/10.147.176.140:9300]]])
[2012-01-11 10:04:13,164][INFO ][discovery.zen ]
[search-03] master_left
[[search-02][LWPgo4rHRASq1nrCI2ZRAA][inet[/10.147.176.140:9300]]],
reason [shut_down]
[2012-01-11 10:04:13,180][INFO ][cluster.service ]
[search-03] master {new
[search-03][5wOHm0i8SmeVqL5JkRSaFQ][inet[/10.147.174.142:9300]],
previous [search-02][LWPgo4rHRASq1nrCI2ZRAA][inet[/10.147.176.140:9300]]},
removed {[search-02][LWPgo4rHRASq1nrCI2ZRAA][inet[/10.147.176.140:9300]],},
reason: zen-disco-master_failed
([search-02][LWPgo4rHRASq1nrCI2ZRAA][inet[/10.147.176.140:9300]])
[2012-01-11 10:04:13,197][INFO ][cluster.metadata ]
[search-03] updating number_of_replicas to [1] for indices
[search_engine_20120110_104503]
[2012-01-11 10:04:13,200][INFO ][cluster.metadata ]
[search-03] [search_engine_20120110_104503] auto expanded replicas to
[1]
[2012-01-11 10:04:13,200][INFO ][cluster.metadata ]
[search-03] updating number_of_replicas to [1] for indices
[search_engine_20120109_104503]
[2012-01-11 10:04:13,202][INFO ][cluster.metadata ]
[search-03] [search_engine_20120109_104503] auto expanded replicas to
[1]
[2012-01-11 10:04:13,202][INFO ][cluster.metadata ]
[search-03] updating number_of_replicas to [1] for indices
[search_engine_20120110_104503]
[2012-01-11 10:04:13,204][INFO ][cluster.metadata ]
[search-03] [search_engine_20120110_104503] auto expanded replicas to
[1]
[2012-01-11 10:04:13,204][INFO ][cluster.metadata ]
[search-03] updating number_of_replicas to [1] for indices
[search_engine_20120109_104503]
[2012-01-11 10:04:13,205][INFO ][cluster.metadata ]
[search-03] [search_engine_20120109_104503] auto expanded replicas to
[1]
[2012-01-11 10:04:13,205][INFO ][cluster.metadata ]
[search-03] updating number_of_replicas to [1] for indices
[search_engine_20120109_104503]
[2012-01-11 10:04:13,207][INFO ][cluster.metadata ]
[search-03] [search_engine_20120109_104503] auto expanded replicas to
[1]
[2012-01-11 10:05:50,946][WARN ][transport.netty ]
[search-03] Message not fully read (request) for [0] and action
[discovery/zen/unicast], resetting
[2012-01-11 10:05:52,380][WARN ][transport.netty ]
[search-03] Message not fully read (request) for [4] and action
[discovery/zen/unicast], resetting
[2012-01-11 10:05:53,884][WARN ][transport.netty ]
[search-03] Message not fully read (request) for [7] and action
[discovery/zen/unicast], resetting
[2012-01-11 10:06:05,333][WARN ][transport.netty ]
[search-03] Message not fully read (request) for [1] and action
[discovery/zen/unicast], resetting
[2012-01-11 10:06:06,783][WARN ][transport.netty ]
[search-03] Message not fully read (request) for [6] and action
[discovery/zen/unicast], resetting
[2012-01-11 10:06:08,288][WARN ][transport.netty ]
[search-03] Message not fully read (request) for [10] and action
[discovery/zen/unicast], resetting
[2012-01-11 10:06:48,924][INFO ][node ]
[search-03] {elasticsearch/0.17.8}[31667]: stopping ...
[2012-01-11 10:06:49,174][INFO ][node ]
[search-03] {elasticsearch/0.17.8}[31667]: stopped
[2012-01-11 10:06:49,175][INFO ][node ]
[search-03] {elasticsearch/0.17.8}[31667]: closing ...
[2012-01-11 10:06:49,587][INFO ][node ]
[search-03] {elasticsearch/0.17.8}[31667]: closed
[2012-01-11 10:07:40,405][INFO ][node ]
[search-03] {0.18.7}[31045]: initializing ...
[2012-01-11 10:07:40,413][INFO ][plugins ]
[search-03] loaded [], sites [bigdesk, head]
[2012-01-11 10:07:41,712][INFO ][node ]
[search-03] {0.18.7}[31045]: initialized
[2012-01-11 10:07:41,713][INFO ][node ]
[search-03] {0.18.7}[31045]: starting ...
[2012-01-11 10:07:41,769][INFO ][transport ]
[search-03] bound_address {inet[/0:0:0:0:0:0:0:0:9300]},
publish_address {inet[/10.147.174.142:9300]}
Anyway I'll see it if happens again when we upgrade. It didn't happen
with upgrades within 0.17.*, but due to the way our cluster is setup
(we aren't populating data all the time) having a split brain for a
bit is not that big of a deal.