Unable to join the cluster


(Filirom1) #1

Hi,

We have a cluster of 3 servers (11s, 12s, 13s) that do nothing (no logs
during severals days).

At 5h11, 11s and 12s removed 13s because of a fail to ping:
[2012-07-06 05:11:04,329][INFO ][cluster.service ] [tpsmdt12s]
removed
{[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason:
zen-disco-node_failed([tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}),
reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-07-06 05:11:04,335][INFO ][cluster.service ] [tpsmdt11s]
removed
{[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason: zen-disco-receive(from master
[[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}])

11s and 12s, start rebalancing their shards

At 5h15, 13s receive a Warning : a big gc of 5.4m, and get informed that he
was disconnected from 11s and 12s

[2012-07-06 05:15:01,705][WARN ][monitor.jvm ] [tpsmdt13s]
[gc][ParNew][311645][14014] duration [5.4m], collections [1]/[5.4m], total
[5.4m]/[7.3m], memory [1.3gb]->[1.3gb]/[7.9gb]
[2012-07-06 05:15:01,715][INFO ][discovery.zen ] [tpsmdt13s]
master_left
[[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}],
reason [do not exists on master, act as master failure]
[2012-07-06 05:15:01,717][INFO ][cluster.service ] [tpsmdt13s]
master {new
[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},
previous
[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}},
removed
{[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A},},
reason: zen-disco-master_failed
([tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A})

Then I wake up, took some coffee, and discover the disconnection.
When a server is disconnected from the cluster, we usually remove all the
data inside, and restart it.

But here it didn't work, the cluster do not want 13s to join the cluster,
neither 14s (a new node)

Here is the log that is repeating infinitely:

2012-07-06 11:58:47,164][INFO ][discovery.zen ] [tpsmdt13s] failed to send join request to master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], reason [org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}]]
[2012-07-06 11:58:47,164][TRACE][discovery.zen ] [tpsmdt13s] detailed failed reason
org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]
Caused by: org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}]
at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(ZenDiscovery.java:555)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$1900(ZenDiscovery.java:75)
at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(ZenDiscovery.java:704)
at org.elasticsearch.discovery.zen.membership.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:152)
at org.elasticsearch.discovery.zen.membership.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:141)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:390)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,665][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,667][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,168][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,168][TRACE][discovery.zen ] [tpsmdt13s] full ping responses:
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][discovery.zen ] [tpsmdt13s] filtered ping responses: (filter_client[true], filter_data[false])
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:50,169][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,170][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]


(Igor Motov) #2

What you have got is a typical split brain situation: the node tpsmdt13s
got temporarily disconnected from the rest of the cluster and elected
itself as a master, while tpsmdt12s remained the master for the rest of the
cluster (tpsmdt11s and itself). Typically restarting the rogue node
(tpsmdt13s) helps the situation. If this doesn't work a full cluster
restart is in order. Starting a new node (tpsmdt14s) in such situation is a
bad idea since the new node will be getting mixed messages about the master
node from both real and rogue master nodes.

On Friday, July 6, 2012 9:08:20 AM UTC-4, Filirom1 wrote:

Hi,

We have a cluster of 3 servers (11s, 12s, 13s) that do nothing (no logs
during severals days).

At 5h11, 11s and 12s removed 13s because of a fail to ping:
[2012-07-06 05:11:04,329][INFO ][cluster.service ] [tpsmdt12s]
removed
{[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason:
zen-disco-node_failed([tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}),
reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-07-06 05:11:04,335][INFO ][cluster.service ] [tpsmdt11s]
removed
{[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason: zen-disco-receive(from master
[[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}])

11s and 12s, start rebalancing their shards

At 5h15, 13s receive a Warning : a big gc of 5.4m, and get informed that
he was disconnected from 11s and 12s

[2012-07-06 05:15:01,705][WARN ][monitor.jvm ] [tpsmdt13s]
[gc][ParNew][311645][14014] duration [5.4m], collections [1]/[5.4m], total
[5.4m]/[7.3m], memory [1.3gb]->[1.3gb]/[7.9gb]
[2012-07-06 05:15:01,715][INFO ][discovery.zen ] [tpsmdt13s]
master_left
[[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}],
reason [do not exists on master, act as master failure]
[2012-07-06 05:15:01,717][INFO ][cluster.service ] [tpsmdt13s]
master {new
[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},
previous
[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}},
removed
{[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A},},
reason: zen-disco-master_failed
([tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A})

Then I wake up, took some coffee, and discover the disconnection.
When a server is disconnected from the cluster, we usually remove all the
data inside, and restart it.

But here it didn't work, the cluster do not want 13s to join the cluster,
neither 14s (a new node)

Here is the log that is repeating infinitely:

2012-07-06 11:58:47,164][INFO ][discovery.zen ] [tpsmdt13s] failed to send join request to master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], reason [org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}]]
[2012-07-06 11:58:47,164][TRACE][discovery.zen ] [tpsmdt13s] detailed failed reason
org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]
Caused by: org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}]
at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(ZenDiscovery.java:555)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$1900(ZenDiscovery.java:75)
at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(ZenDiscovery.java:704)
at org.elasticsearch.discovery.zen.membership.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:152)
at org.elasticsearch.discovery.zen.membership.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:141)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:390)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,665][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,667][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,168][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,168][TRACE][discovery.zen ] [tpsmdt13s] full ping responses:
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][discovery.zen ] [tpsmdt13s] filtered ping responses: (filter_client[true], filter_data[false])
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.165.14:9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:50,169][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,170][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]


(Filirom1) #3

Hi Igor, and thank you very much for your answer.

What you describe is exactly what I saw in my cluster.

Is it something that happens frequently (split-brain) ?
Or perhaps a better question is, should I be preparded to full cluster
restart during production ? And how often (maybe difficult to answer) ?
By default we disable the gateway because we are able to reindex
everything. But if cluster restart are frequent, we should change this
configuration.

The strange thing about the split-brain is that 13s was configured with
minimum_master_nodes: 2.

discovery:
zen:
ping:
multicast.enabled: false
unicast.hosts: ["11s:9027", "12s:9027"]
minimum_master_nodes: 2

Thank you

Romain

2012/7/8 Igor Motov imotov@gmail.com

What you have got is a typical split brain situation: the node tpsmdt13s
got temporarily disconnected from the rest of the cluster and elected
itself as a master, while tpsmdt12s remained the master for the rest of the
cluster (tpsmdt11s and itself). Typically restarting the rogue node
(tpsmdt13s) helps the situation. If this doesn't work a full cluster
restart is in order. Starting a new node (tpsmdt14s) in such situation is a
bad idea since the new node will be getting mixed messages about the master
node from both real and rogue master nodes.

On Friday, July 6, 2012 9:08:20 AM UTC-4, Filirom1 wrote:

Hi,

We have a cluster of 3 servers (11s, 12s, 13s) that do nothing (no logs
during severals days).

At 5h11, 11s and 12s removed 13s because of a fail to ping:
[2012-07-06 05:11:04,329][INFO ][cluster.service ] [tpsmdt12s]
removed {[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B},},
reason: zen-disco-node_failed([tpsmdt13s][
7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}), reason
failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-07-06 05:11:04,335][INFO ][cluster.service ] [tpsmdt11s]
removed {[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason: zen-disco-receive(from master [[tpsmdt12s][Hj_

6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}])

11s and 12s, start rebalancing their shards

At 5h15, 13s receive a Warning : a big gc of 5.4m, and get informed that
he was disconnected from 11s and 12s

[2012-07-06 05:15:01,705][WARN ][monitor.jvm ] [tpsmdt13s]
[gc][ParNew][311645][14014] duration [5.4m], collections [1]/[5.4m], total
[5.4m]/[7.3m], memory [1.3gb]->[1.3gb]/[7.9gb]
[2012-07-06 05:15:01,715][INFO ][discovery.zen ] [tpsmdt13s]
master_left [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A}],
reason [do not exists on master, act as master failure]
[2012-07-06 05:15:01,717][INFO ][cluster.service ] [tpsmdt13s]
master {new [tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/
10.26.165.16:9027]]{zone=B}, previous [tpsmdt12s][Hj_

6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}}, removed
{[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A},},
reason: zen-disco-master_failed ([tpsmdt12s][Hj_

6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A})

Then I wake up, took some coffee, and discover the disconnection.
When a server is disconnected from the cluster, we usually remove all the
data inside, and restart it.

But here it didn't work, the cluster do not want 13s to join the cluster,
neither 14s (a new node)

Here is the log that is repeating infinitely:

2012-07-06 11:58:47,164][INFO ][discovery.zen ] [tpsmdt13s] failed to send join request to master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], reason [org.elasticsearch.transport.**RemoteTransportException: [tpsmdt13s][inet[/10.26.165.**16:9027]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}]]
[2012-07-06 11:58:47,164][TRACE][**discovery.zen ] [tpsmdt13s] detailed failed reason
org.elasticsearch.transport.**RemoteTransportException: [tpsmdt13s][inet[/10.26.165.**16:9027]][discovery/zen/join]
Caused by: org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}]
at org.elasticsearch.discovery.**zen.ZenDiscovery.**handleJoinRequest(**ZenDiscovery.java:555)
at org.elasticsearch.discovery.**zen.ZenDiscovery.access$1900(**ZenDiscovery.java:75)
at org.elasticsearch.discovery.**zen.ZenDiscovery$**MembershipListener.onJoin(**ZenDiscovery.java:704)
at org.elasticsearch.discovery.**zen.membership.**MembershipAction$**JoinRequestRequestHandler.**messageReceived(**MembershipAction.java:152)
at org.elasticsearch.discovery.**zen.membership.**MembershipAction$**JoinRequestRequestHandler.**messageReceived(**MembershipAction.java:141)
at org.elasticsearch.transport.**netty.MessageChannelHandler$**RequestHandler.run(**MessageChannelHandler.java:**390)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,665][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,667][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,168][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,168][TRACE][**discovery.zen ] [tpsmdt13s] full ping responses:
--> target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][**discovery.zen ] [tpsmdt13s] filtered ping responses: (filter_client[true], filter_data[false])
--> target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:50,169][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,170][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]


(Igor Motov) #4

Split brain usually happens when a master node gets overloaded and stops
responding to fault detection pings long enough for other nodes to decide
that this node is dead. There are two basic remedies to this issue:
increase fault detection timeout (by increasing number of ping_retrieshttp://www.elasticsearch.org/guide/reference/modules/discovery/zen.html,
for example) and make sure that your nodes are not getting too overloaded.

On Monday, July 9, 2012 10:23:55 AM UTC-4, Filirom1 wrote:

Hi Igor, and thank you very much for your answer.

What you describe is exactly what I saw in my cluster.

Is it something that happens frequently (split-brain) ?
Or perhaps a better question is, should I be preparded to full cluster
restart during production ? And how often (maybe difficult to answer) ?
By default we disable the gateway because we are able to reindex
everything. But if cluster restart are frequent, we should change this
configuration.

The strange thing about the split-brain is that 13s was configured with
minimum_master_nodes: 2.

discovery:
zen:
ping:
multicast.enabled: false
unicast.hosts: ["11s:9027", "12s:9027"]
minimum_master_nodes: 2

Thank you

Romain

2012/7/8 Igor Motov imotov@gmail.com

What you have got is a typical split brain situation: the node tpsmdt13s
got temporarily disconnected from the rest of the cluster and elected
itself as a master, while tpsmdt12s remained the master for the rest of the
cluster (tpsmdt11s and itself). Typically restarting the rogue node
(tpsmdt13s) helps the situation. If this doesn't work a full cluster
restart is in order. Starting a new node (tpsmdt14s) in such situation is a
bad idea since the new node will be getting mixed messages about the master
node from both real and rogue master nodes.

On Friday, July 6, 2012 9:08:20 AM UTC-4, Filirom1 wrote:

Hi,

We have a cluster of 3 servers (11s, 12s, 13s) that do nothing (no logs
during severals days).

At 5h11, 11s and 12s removed 13s because of a fail to ping:
[2012-07-06 05:11:04,329][INFO ][cluster.service ] [tpsmdt12s]
removed {[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B},},
reason: zen-disco-node_failed([tpsmdt13s][
7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}), reason
failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-07-06 05:11:04,335][INFO ][cluster.service ] [tpsmdt11s]
removed {[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B},},
reason: zen-disco-receive(from master [[tpsmdt12s][Hj_

6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}])

11s and 12s, start rebalancing their shards

At 5h15, 13s receive a Warning : a big gc of 5.4m, and get informed that
he was disconnected from 11s and 12s

[2012-07-06 05:15:01,705][WARN ][monitor.jvm ] [tpsmdt13s]
[gc][ParNew][311645][14014] duration [5.4m], collections [1]/[5.4m], total
[5.4m]/[7.3m], memory [1.3gb]->[1.3gb]/[7.9gb]
[2012-07-06 05:15:01,715][INFO ][discovery.zen ] [tpsmdt13s]
master_left [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}],
reason [do not exists on master, act as master failure]
[2012-07-06 05:15:01,717][INFO ][cluster.service ] [tpsmdt13s]
master {new [tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B},
previous [tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}},
removed {[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:9027]]{zone=A},},
reason: zen-disco-master_failed ([tpsmdt12s][Hj_

6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A})

Then I wake up, took some coffee, and discover the disconnection.
When a server is disconnected from the cluster, we usually remove all
the data inside, and restart it.

But here it didn't work, the cluster do not want 13s to join the
cluster, neither 14s (a new node)

Here is the log that is repeating infinitely:

2012-07-06 11:58:47,164][INFO ][discovery.zen ] [tpsmdt13s] failed to send join request to master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], reason [org.elasticsearch.transport.**RemoteTransportException: [tpsmdt13s][inet[/10.26.165.**16:9027]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}]]
[2012-07-06 11:58:47,164][TRACE][**discovery.zen ] [tpsmdt13s] detailed failed reason
org.elasticsearch.transport.**RemoteTransportException: [tpsmdt13s][inet[/10.26.165.**16:9027]][discovery/zen/join]
Caused by: org.elasticsearch.ElasticSearchIllegalStateException: Node [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}]
at org.elasticsearch.discovery.**zen.ZenDiscovery.**handleJoinRequest(**ZenDiscovery.java:555)
at org.elasticsearch.discovery.**zen.ZenDiscovery.access$1900(**ZenDiscovery.java:75)
at org.elasticsearch.discovery.**zen.ZenDiscovery$**MembershipListener.onJoin(**ZenDiscovery.java:704)
at org.elasticsearch.discovery.**zen.membership.**MembershipAction$**JoinRequestRequestHandler.**messageReceived(**MembershipAction.java:152)
at org.elasticsearch.discovery.**zen.membership.**MembershipAction$**JoinRequestRequestHandler.**messageReceived(**MembershipAction.java:141)
at org.elasticsearch.transport.**netty.MessageChannelHandler$**RequestHandler.run(**MessageChannelHandler.java:**390)
at java.util.concurrent.**ThreadPoolExecutor$Worker.**runTask(ThreadPoolExecutor.**java:886)
at java.util.concurrent.**ThreadPoolExecutor$Worker.run(**ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.**java:662)
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,665][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,667][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-**R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,168][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,168][TRACE][**discovery.zen ] [tpsmdt13s] full ping responses:
--> target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][**discovery.zen ] [tpsmdt13s] filtered ping responses: (filter_client[true], filter_data[false])
--> target [[tpsmdt11s][aMJbFd-**eRK6jKxlow3dSCw][inet[/10.26.**165.14:9027]]{zone=A}], master [[tpsmdt13s][**7rfILKuXSLegJIT2bU9bsw][inet[/**10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}], master [[tpsmdt12s][Hj_**6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:**9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:50,169][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,170][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_2#][inet[tpsmdt12s.priv.atos.fr/10.26.165.15:**9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]


(Filirom1) #5

Ok thank you, I will put ping_timeout: 60s and ping_retries: 6 to be
able to handle a GC of 5min.
Or there is an other option, I kill ES with a SIGKILL if it stop
responding to ping_retries. So the old master could not live with the new
master.

But, I am still wondering why this GC blocked my ES during 5minutes.
It's just crazy, because nothing has changed in the Young Generation.

I monitored the JVM GC jstat -gc:

       S0C    S1C    S0U    S1U      EC       EU        OC

OU PC PU YGC YGCT FGC FGCT GCT

05:08:02 2112.0 2112.0 74.0 0.0 17024.0 6306.3 2094212.0
1431738.8 60736.0 36564.6 14010 114.527 16 0.048 114.575

05:09:01 2112.0 2112.0 106.2 0.0 17024.0 12290.5 2094212.0
1431795.2 60736.0 36564.6 14012 114.535 16 0.048 114.584

05:10:01 2112.0 2112.0 92.9 125.7 17024.0 17024.0 2094212.0
1431961.9 60736.0 36564.6 14014 114.539 16 0.048 114.588

05:11:02 2112.0 2112.0 92.9 125.7 17024.0 17024.0 2094212.0
1431961.9 60736.0 36564.6 14014 114.539 16 0.048 114.588

05:12:01 2112.0 2112.0 92.9 125.7 17024.0 17024.0 2094212.0
1431961.9 60736.0 36564.6 14014 114.539 16 0.048 114.588

05:13:02 2112.0 2112.0 92.9 125.7 17024.0 17024.0 2094212.0
1431961.9 60736.0 36564.6 14014 114.539 16 0.048 114.588

05:14:01 2112.0 2112.0 92.9 125.7 17024.0 17024.0 2094212.0
1431961.9 60736.0 36564.6 14014 114.539 16 0.048 114.588

05:15:01 2112.0 2112.0 92.9 0.0 17024.0 6787.5 2094212.0
1431851.0 60736.0 36677.2 14014 442.206 16 0.048 442.254

05:16:02 2112.0 2112.0 1055.1 0.0 17024.0 396.8 2094212.0
1433308.8 60736.0 36686.7 14016 442.226 16 0.048 442.274

05:17:01 2112.0 2112.0 0.0 258.4 17024.0 10533.0 2094212.0
1433310.6 60736.0 36686.7 14017 442.231 16 0.048 442.279

YGCT (Young generation garbage collection time) increase from 114s to 442s.
But Young Generation utilization did not change.

Crazy thing.

2012/7/9 Igor Motov imotov@gmail.com

Split brain usually happens when a master node gets overloaded and stops
responding to fault detection pings long enough for other nodes to decide
that this node is dead. There are two basic remedies to this issue:
increase fault detection timeout (by increasing number of ping_retrieshttp://www.elasticsearch.org/guide/reference/modules/discovery/zen.html,
for example) and make sure that your nodes are not getting too overloaded.

On Monday, July 9, 2012 10:23:55 AM UTC-4, Filirom1 wrote:

Hi Igor, and thank you very much for your answer.

What you describe is exactly what I saw in my cluster.

Is it something that happens frequently (split-brain) ?
Or perhaps a better question is, should I be preparded to full cluster
restart during production ? And how often (maybe difficult to answer) ?
By default we disable the gateway because we are able to reindex
everything. But if cluster restart are frequent, we should change this
configuration.

The strange thing about the split-brain is that 13s was configured with
minimum_master_nodes: 2.

discovery:
zen:
ping:
multicast.enabled: false
unicast.hosts: ["11s:9027", "12s:9027"]
minimum_master_nodes: 2

Thank you

Romain

2012/7/8 Igor Motov imotov@gmail.com

What you have got is a typical split brain situation: the node tpsmdt13s
got temporarily disconnected from the rest of the cluster and elected
itself as a master, while tpsmdt12s remained the master for the rest of the
cluster (tpsmdt11s and itself). Typically restarting the rogue node
(tpsmdt13s) helps the situation. If this doesn't work a full cluster
restart is in order. Starting a new node (tpsmdt14s) in such situation is a
bad idea since the new node will be getting mixed messages about the master
node from both real and rogue master nodes.

On Friday, July 6, 2012 9:08:20 AM UTC-4, Filirom1 wrote:

Hi,

We have a cluster of 3 servers (11s, 12s, 13s) that do nothing (no logs
during severals days).

At 5h11, 11s and 12s removed 13s because of a fail to ping:
[2012-07-06 05:11:04,329][INFO ][cluster.service ] [tpsmdt12s]
removed {[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/**
10.26.165.16:9027**]]{zone=B},}, reason: zen-disco-node_failed([**
tpsmdt1**3s][**7rfILKuXSLegJIT2bU9bsw][**inet[/10.26.165.16:9027]]{zone=B}),
reason failed to ping, tried [3] times, each with maximum [30s] timeout
[2012-07-06 05:11:04,335][INFO ][cluster.service ] [tpsmdt11s]
removed {[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/

10.26.165.16:9027
]]{zone=B},}, reason: zen-disco-receive(from master
[[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.26.165.15:
9027]]{zone=A}])

11s and 12s, start rebalancing their shards

At 5h15, 13s receive a Warning : a big gc of 5.4m, and get informed
that he was disconnected from 11s and 12s

[2012-07-06 05:15:01,705][WARN ][monitor.jvm ] [tpsmdt13s]
[gc][ParNew][311645][14014] duration [5.4m], collections [1]/[5.4m], total
[5.4m]/[7.3m], memory [1.3gb]->[1.3gb]/[7.9gb]
[2012-07-06 05:15:01,715][INFO ][discovery.zen ] [tpsmdt13s]
master_left [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**
26.165.15:9027]]{zone=A}], reason [do not exists on master, act as
master failure]
[2012-07-06 05:15:01,717][INFO ][cluster.service ] [tpsmdt13s]
master {new [tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/

10.26.165.16:9027]]{zone=B}, previous [tpsmdt12s][Hj_
6OzgTT4ONz5vtw8**orbA][inet[/10.26.165.15:9027]]{zone=A}}, removed
{[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A},},
reason: zen-disco-master_failed ([tpsmdt12s][Hj_6OzgTT4ONz5vtw
8orbA][inet[/10.**26.165.15:**9027]]{zone=A})

Then I wake up, took some coffee, and discover the disconnection.
When a server is disconnected from the cluster, we usually remove all
the data inside, and restart it.

But here it didn't work, the cluster do not want 13s to join the
cluster, neither 14s (a new node)

Here is the log that is repeating infinitely:

2012-07-06 11:58:47,164][INFO ][discovery.zen ] [tpsmdt13s] failed to send join request to master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], reason [org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateExcep**tion: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:9027]]{zone=A}]]
[2012-07-06 11:58:47,164][TRACE][discovery.zen ] [tpsmdt13s] detailed failed reason
org.elasticsearch.transport.RemoteTransportException: [tpsmdt13s][inet[/10.26.165.16:9027]][discovery/zen/join]
Caused by: org.elasticsearch.ElasticSearchIllegalStateExcep
tion: Node [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}] not master for join request from [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}]
at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(**ZenDiscovery.java:555)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$1900(ZenDiscovery.java:75)
at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(**ZenDiscovery.java:**704)
at org.elasticsearch.discovery.zen.membership.**MembershipAction$JoinRequestRequestHandler.messageReceived(**MembershipAction.**java:152)
at org.elasticsearch.discovery.zen.membership.**MembershipAction$JoinRequestRequestHandler.messageReceived(**MembershipAction.**java:141)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:**390)
at java.util.concurrent.ThreadPoolExecutor$Worker.**runTask(**ThreadPoolExecutor.**java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_1#][inet[tpsmdt1
1s.priv.atos.fr/10.26.165.14:9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,165][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connecting (light) to [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:47,166][DEBUG][transport.netty ] [tpsmdt13s] connected to node [[#zen_unicast_2#][inet[tpsmdt
12s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_1#][inet[tpsmdt1
1s.priv.atos.fr/10.26.165.14:9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] connected to [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:47,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.**165.14:**9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:47,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,665][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:48,666][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.**165.14:**9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:48,667][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,166][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] sending to [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.**165.14:**9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,167][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] received response from [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]: [ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt13s][P2t_n0U-R2GCBXTvCLBWAg][inet[/10.26.**165.16:**9027]]{zone=A}], master [null], cluster_name[search-bench]}, ping_response{target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], cluster_name[search-bench]}]
[2012-07-06 11:58:50,168][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_1#][inet[tpsmdt11s.priv.atos.fr/10.26.165.14:9027 http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]
[2012-07-06 11:58:50,168][TRACE][discovery.zen ] [tpsmdt13s] full ping responses:
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.**165.14:**9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][discovery.zen ] [tpsmdt13s] filtered ping responses: (filter_client[true], filter_data[false])
--> target [[tpsmdt11s][aMJbFd-eRK6jKxlow3dSCw][inet[/10.26.**165.14:**9027]]{zone=A}], master [[tpsmdt13s][7rfILKuXSLegJIT2bU9bsw][inet[/10.26.165.16:9027]]{zone=B}]
--> target [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}], master [[tpsmdt12s][Hj_6OzgTT4ONz5vtw8orbA][inet[/10.**26.165.15:**9027]]{zone=A}]
[2012-07-06 11:58:50,169][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_1#][inet[tpsmdt
11s.priv.atos.fr/10.26.165.14:9027] http://tpsmdt11s.priv.atos.fr/10.26.165.14:9027]]]
[2012-07-06 11:58:50,169][TRACE][discovery.zen.ping.unicast] [tpsmdt13s] [15] disconnecting from [#zen_unicast_2#][inet[tpsmdt1
2s.priv.atos.fr/10.26.165.15:9027 http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]
[2012-07-06 11:58:50,170][DEBUG][transport.netty ] [tpsmdt13s] disconnected from [[#zen_unicast_2#][inet[tpsmdt****12s.priv.atos.fr/10.26.165.15:****9027] http://tpsmdt12s.priv.atos.fr/10.26.165.15:9027]]]


(system) #6