Hi,
We have a cluster of 3 servers (11s, 12s, 13s) that do nothing (no logs
during severals days).
At 5h11, 11s and 12s removed 13s because of a fail to ping:
[2012-07-06 05:11:04,329][INFO ][cluster.service ] [tpsmdt12s]
removed
{[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason:
zen-disco-node_failed([tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}),
reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-07-06 05:11:04,335][INFO ][cluster.service ] [tpsmdt11s]
removed
{[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason: zen-disco-receive(from master
[[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}])
11s and 12s, start rebalancing their shards
At 5h15, 13s receive a Warning : a big gc of 5.4m, and get informed that he
was disconnected from 11s and 12s
[2012-07-06 05:15:01,705][WARN ][monitor.jvm ] [tpsmdt13s]
[gc][ParNew][311645][14014] duration [5.4m], collections [1]/[5.4m], total
[5.4m]/[7.3m], memory [1.3gb]->[1.3gb]/[7.9gb]
[2012-07-06 05:15:01,715][INFO ][discovery.zen ] [tpsmdt13s]
master_left
[[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}],
reason [do not exists on master, act as master failure]
[2012-07-06 05:15:01,717][INFO ][cluster.service ] [tpsmdt13s]
master {new
[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},
previous
[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}},
removed
{[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A},},
reason: zen-disco-master_failed
([tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A})
Then I wake up, took some coffee, and discover the disconnection.
When a server is disconnected from the cluster, we usually remove all the
data inside, and restart it.
But here it didn't work, the cluster do not want 13s to join the cluster,
neither 14s (a new node)
Here is the log that is repeating infinitely:
2012-07-06 11:58:47,164][INFO ][discovery.zen ] [tpsmdt13s] failed to send join request to master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], reason [org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}]]
[2012-07-06 11:58:47,164][TRACE][discovery.zen ] [tpsmdt13s] detailed failed reason
org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]
Caused by: org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}]
at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(ZenDiscovery.java:555)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$1900(ZenDiscovery.java:75)
at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(ZenDiscovery.java:704)
at org.elasticsearch.discovery.zen.membership.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:152)
at org.elasticsearch.discovery.zen.membership.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:141)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:390)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,665][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,667][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,168][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,168][TRACE][discovery.zen ] [tpsmdt13s] full ping responses:
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][discovery.zen ] [tpsmdt13s] filtered ping responses: (filter_client[true], filter_data[false])
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:50,169][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,170][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]