Hi all.
I have a 13 node cluster running 1.6.0:
3 dedicated masters
4 dedicated clients
6 dedicated data nodes
All was well until one of the data nodes logged the following exception, and disconnected itself from my cluster:
[2015-07-09 09:53:06,953][WARN ][transport.netty ] [elasticsearch-bdprodes08] exception caught on transport layer [[id: 0x6b716852, /10.200.116.249:60911 :> /10.200.116.248:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format, got (3,41,4d,52)
at org.elasticsearch.transport.netty.SizeHeaderFrameDecoder.decode(SizeHeaderFrameDecoder.java:63)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:425)
at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:74)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-07-09 09:53:07,293][INFO ][discovery.zen ] [elasticsearch-bdprodes08] master_left [[elasticsearch-bdprodes03-2][AAxFokkhSBqOTojbhGO-EQ][bdprodes03][inet[/10.200.116.70:9301]]{data=false, master=true}], reason [do not exists on master, act as master failure]
[2015-07-09 09:53:07,413][WARN ][discovery.zen ] [elasticsearch-bdprodes08] master left (reason = do not exists on master, act as master failure), current nodes: {[elasticsearch-bdprodes01-2][D9MwysnEQXWKr5scrYLNpA][bdprodes01][inet[/10.200.116.68:9301]]{data=false, master=true},[elasticsearch-bdprodes05][DvJiqAM9TE-CVRL6v3YArw][bdprodes05][inet[/10.200.116.72:9300]]{master=false},[elasticsearch-bdprodes09][VoCSUvcRQFKW65EgV4bBYQ][bdprodes09][inet[/10.200.116.249:9300]]{master=false},[elasticsearch-bdprodes06][yyJn5RjZQpeg5hIf0e_4QA][bdprodes06][inet[/10.200.116.73:9300]]{master=false},[elasticsearch-bdprodes02-2][G9tH1fyITSqVW8lJ9vccVw][bdprodes02][inet[/10.200.116.69:9301]]{data=false, master=true},[elasticsearch-bdprodes02][sJM-puI8RSmTdQjIW85J4Q][bdprodes02][inet[/10.200.116.69:9300]]{data=false, master=false},[elasticsearch-bdprodes08][ZxaQ4iHJTO-vx5L8VqTbZA][bdprodes08][inet[bdprodes08.dbhotelcloud.com/10.200.116.248:9300]]{master=false},[elasticsearch-bdprodes04][_VGeVBHRR1ukhYkrZ1rVOQ][bdprodes04][inet[/10.200.116.71:9300]]{data=false, master=false},[elasticsearch-bdprodes10][BvTjSLARRmiMFdBkrIUdHQ][bdprodes10][inet[/10.200.116.250:9300]]{master=false},[elasticsearch-bdprodes01][nMPRh9BUSPaj_PWgk6LBoQ][bdprodes01][inet[/10.200.116.68:9300]]{data=false, master=false},[elasticsearch-bdprodes03][oQj7PUa8R5aw_2jcIxCbfA][bdprodes03][inet[/10.200.116.70:9300]]{data=false, master=false},[elasticsearch-bdprodes07][y6JfVvVpRjev4y6PEil9Eg][bdprodes07][inet[/10.200.116.247:9300]]{master=false},}
The other Data nodes simply log this:
[2015-07-09 05:34:18,757][INFO ][cluster.service ] [elasticsearch-bdprodes05] removed {[elasticsearch-bdprodes08][ZxaQ4iHJTO-vx5L8VqTbZA][bdprodes08][inet[/10.200.116.248:9300]]{master=false},}, reason: zen-disco-receive(from master [[elasticsearch-bdprodes03-2][AAxFokkhSBqOTojbhGO-EQ][bdprodes03][inet[/10.200.116.70:9301]]{data=false, master=true}])
And the master logged his:
[2015-07-09 05:34:18,598][INFO ][cluster.service ] [elasticsearch-bdprodes03-2] removed {[elasticsearch-bdprodes08][ZxaQ4iHJTO-vx5L8VqTbZA][bdprodes08][inet[bdprodes08.dbhotelcloud.com/10.200.116.248:9300]]{master=false},}, reason: zen-disco-node_failed([elasticsearch-bdprodes08][ZxaQ4iHJTO-vx5L8VqTbZA][bdprodes08][inet[bdprodes08.dbhotelcloud.com/10.200.116.248:9300]]{master=false}), reason failed to ping, tried [3] times, each with maximum [30s] timeout
What the heck happened??
This one was a new one for me. I tried just bouncing the node that got the exception, but that did not restore the cluster, so I ended up doing a full restart. Ouch.
Anyone seen this before?
Many thanks!
Chris