Any ideas what would be causing this ? If I start my elasticsearch
nodes within a few minutes of eachother they discover eachother correctly
and everything is good.
If I have them connected for a while ( an hour or so ) and then shutdown
one and start it again it comes up split brained. Here's some logs with
discovery set to Trace.
Starting the nodes within a few minutes they find eachother and the
logstash client nodes get in on the action also.
Node1:
Waiting for ElasticSearch......
running: PID:3042
[2012-11-28 17:46:36,504][INFO ][plugins ] [Squirrel Girl]
loaded [river-rabbitmq], sites [head]
[2012-11-28 17:46:37,235][DEBUG][discovery.zen.ping.multicast] [Squirrel
Girl] using group [224.2.2.4], with port [54328], ttl [3], and address
[null]
[2012-11-28 17:46:37,237][DEBUG][discovery.zen.ping.unicast] [Squirrel
Girl] using initial hosts [], with concurrent_connects [10]
[2012-11-28 17:46:37,237][DEBUG][discovery.zen ] [Squirrel Girl]
using ping.timeout [3s], master_election.filter_client [true],
master_election.filter_data [false]
[2012-11-28 17:46:37,241][DEBUG][discovery.zen.elect ] [Squirrel Girl]
using minimum_master_nodes [-1]
[2012-11-28 17:46:37,242][DEBUG][discovery.zen.fd ] [Squirrel Girl]
[master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-28 17:46:37,245][DEBUG][discovery.zen.fd ] [Squirrel Girl]
[node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-28 17:46:38,585][INFO ][node ] [Squirrel Girl]
{0.19.8}[3044]: initialized
[2012-11-28 17:46:38,585][INFO ][node ] [Squirrel Girl]
{0.19.8}[3044]: starting ...
[2012-11-28 17:46:38,637][INFO ][transport ] [Squirrel Girl]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.72.228.41:9300]}
[2012-11-28 17:46:41,668][DEBUG][discovery.zen ] [Squirrel Girl]
filtered ping responses: (filter_client[true], filter_data[false]) {none}
[2012-11-28 17:46:41,672][INFO ][cluster.service ] [Squirrel Girl]
new_master [Squirrel
Girl][kVe32Tu_SQW_AL8U-1ICDg][inet[/10.72.228.41:9300]]{rack=rack1},
reason: zen-disco-join (elected_as_master)
[2012-11-28 17:46:41,686][INFO ][discovery ] [Squirrel Girl]
elasticsearch/kVe32Tu_SQW_AL8U-1ICDg
[2012-11-28 17:46:41,712][INFO ][http ] [Squirrel Girl]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.72.228.41:9200]}
[2012-11-28 17:46:41,712][INFO ][node ] [Squirrel Girl]
{0.19.8}[3044]: started
[2012-11-28 17:46:42,791][INFO ][gateway ] [Squirrel Girl]
recovered [3] indices into cluster_state
[2012-11-28 17:46:45,472][INFO ][cluster.service ] [Squirrel Girl]
added {[Blazing
Skull][SQANGh9lTcOPeb1vao8KdA][inet[/10.72.228.43:9300]]{client=true,
data=false},}, reason: zen-disco-receive(join from node[[Blazing
Skull][SQANGh9lTcOPeb1vao8KdA][inet[/10.72.228.43:9300]]{client=true,
data=false}])
[2012-11-28 17:46:45,539][INFO ][cluster.service ] [Squirrel Girl]
added
{[Hellcat][iW1rJzyIQLCKEMXwALQRFw][inet[/10.72.228.43:9301]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Hellcat][iW1rJzyIQLCKEMXwALQRFw][inet[/10.72.228.43:9301]]{client=true,
data=false}])
2012-11-28 17:47:20,656][INFO ][cluster.service ] [Squirrel Girl]
added
{[Ape][tjrFnEoKS6Ch2VdjbxBC1A][inet[/10.72.228.42:9300]]{rack=rack2},},
reason: zen-disco-receive(join from
node[[Ape][tjrFnEoKS6Ch2VdjbxBC1A][inet[/10.72.228.42:9300]]{rack=rack2}])
Node2:
Waiting for ElasticSearch......
running: PID:3042
[2012-11-28 17:46:36,504][INFO ][plugins ] [Squirrel Girl]
loaded [river-rabbitmq], sites [head]
[2012-11-28 17:46:37,235][DEBUG][discovery.zen.ping.multicast] [Squirrel
Girl] using group [224.2.2.4], with port [54328], ttl [3], and address
[null]
[2012-11-28 17:46:37,237][DEBUG][discovery.zen.ping.unicast] [Squirrel
Girl] using initial hosts [], with concurrent_connects [10]
[2012-11-28 17:46:37,237][DEBUG][discovery.zen ] [Squirrel Girl]
using ping.timeout [3s], master_election.filter_client [true],
master_election.filter_data [false]
[2012-11-28 17:46:37,241][DEBUG][discovery.zen.elect ] [Squirrel Girl]
using minimum_master_nodes [-1]
[2012-11-28 17:46:37,242][DEBUG][discovery.zen.fd ] [Squirrel Girl]
[master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-28 17:46:37,245][DEBUG][discovery.zen.fd ] [Squirrel Girl]
[node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-28 17:46:38,585][INFO ][node ] [Squirrel Girl]
{0.19.8}[3044]: initialized
[2012-11-28 17:46:38,585][INFO ][node ] [Squirrel Girl]
{0.19.8}[3044]: starting ...
[2012-11-28 17:46:38,637][INFO ][transport ] [Squirrel Girl]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.72.228.41:9300]}
[2012-11-28 17:46:41,668][DEBUG][discovery.zen ] [Squirrel Girl]
filtered ping responses: (filter_client[true], filter_data[false]) {none}
[2012-11-28 17:46:41,672][INFO ][cluster.service ] [Squirrel Girl]
new_master [Squirrel
Girl][kVe32Tu_SQW_AL8U-1ICDg][inet[/10.72.228.41:9300]]{rack=rack1},
reason: zen-disco-join (elected_as_master)
[2012-11-28 17:46:41,686][INFO ][discovery ] [Squirrel Girl]
elasticsearch/kVe32Tu_SQW_AL8U-1ICDg
[2012-11-28 17:46:41,712][INFO ][http ] [Squirrel Girl]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.72.228.41:9200]}
[2012-11-28 17:46:41,712][INFO ][node ] [Squirrel Girl]
{0.19.8}[3044]: started
[2012-11-28 17:46:42,791][INFO ][gateway ] [Squirrel Girl]
recovered [3] indices into cluster_state
[2012-11-28 17:46:45,472][INFO ][cluster.service ] [Squirrel Girl]
added {[Blazing
Skull][SQANGh9lTcOPeb1vao8KdA][inet[/10.72.228.43:9300]]{client=true,
data=false},}, reason: zen-disco-receive(join from node[[Blazing
Skull][SQANGh9lTcOPeb1vao8KdA][inet[/10.72.228.43:9300]]{client=true,
data=false}])
[2012-11-28 17:46:45,539][INFO ][cluster.service ] [Squirrel Girl]
added
{[Hellcat][iW1rJzyIQLCKEMXwALQRFw][inet[/10.72.228.43:9301]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Hellcat][iW1rJzyIQLCKEMXwALQRFw][inet[/10.72.228.43:9301]]{client=true,
data=false}])
failure scenario ... After an hour or so of function I shutdown one of the
nodes ... The other node sees it shutdown and promotes itself.
Node1:
service elasticsearch stop
Stopping ElasticSearch...
Stopped ElasticSearch.
Node2:
[2012-11-28 21:08:38,692][INFO ][discovery.zen ] [Tutinax the
Mountain-Mover] master_left
[[Silhouette][sgzPM72TRv6dI_paYIo4lg][inet[/10.72.228.42:9300]]{rack=rack2}],
reason [shut_down]
[2012-11-28 21:08:38,694][DEBUG][discovery.zen.fd ] [Tutinax the
Mountain-Mover] [master] stopping fault detection against master
[[Silhouette][sgzPM72TRv6dI_paYIo4lg][inet[/10.72.228.42:9300]]{rack=rack2}],
reason [got elected as new master since master left (reason = shut_down)]
[2012-11-28 21:08:38,695][INFO ][cluster.service ] [Tutinax the
Mountain-Mover] master {new [Tutinax the
Mountain-Mover][yO8eFKAHTl-TPJwIyw99WQ][inet[/10.72.228.41:9300]]{rack=rack1},
previous
[Silhouette][sgzPM72TRv6dI_paYIo4lg][inet[/10.72.228.42:9300]]{rack=rack2}},
removed
{[Silhouette][sgzPM72TRv6dI_paYIo4lg][inet[/10.72.228.42:9300]]{rack=rack2},},
reason: zen-disco-master_failed
([Silhouette][sgzPM72TRv6dI_paYIo4lg][inet[/10.72.228.42:9300]]{rack=rack2})
I then start the node back up that I shut down ... and it comes back up,
gets responses from the client nodes (logstash) but nothing from the
master node so they end up split brained.
Node1:
service elasticsearch start; tail -f
/opt/elasticsearch/logs/elasticsearch.log
Starting ElasticSearch...
Waiting for ElasticSearch......
running: PID:22501
[2012-11-28 21:09:21,861][INFO ][node ] [Sugar Man]
{0.19.8}[22503]: initializing ...
[2012-11-28 21:09:21,868][INFO ][plugins ] [Sugar Man]
loaded [river-rabbitmq], sites [head]
[2012-11-28 21:09:22,589][DEBUG][discovery.zen.ping.multicast] [Sugar Man]
using group [224.2.2.4], with port [54328], ttl [3], and address
[null]
[2012-11-28 21:09:22,591][DEBUG][discovery.zen.ping.unicast] [Sugar Man]
using initial hosts [], with concurrent_connects [10]
[2012-11-28 21:09:22,592][DEBUG][discovery.zen ] [Sugar Man]
using ping.timeout [10s], master_election.filter_client [true],
master_election.filter_data [false]
[2012-11-28 21:09:22,596][DEBUG][discovery.zen.elect ] [Sugar Man]
using minimum_master_nodes [1]
[2012-11-28 21:09:22,597][DEBUG][discovery.zen.fd ] [Sugar Man]
[master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-28 21:09:22,599][DEBUG][discovery.zen.fd ] [Sugar Man]
[node ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
[2012-11-28 21:09:23,885][INFO ][node ] [Sugar Man]
{0.19.8}[22503]: initialized
[2012-11-28 21:09:23,886][INFO ][node ] [Sugar Man]
{0.19.8}[22503]: starting ...
[2012-11-28 21:09:23,938][INFO ][transport ] [Sugar Man]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.72.228.42:9300]}
[2012-11-28 21:09:23,952][TRACE][discovery ] [Sugar Man]
waiting for 30s for the initial state to be set by the discovery
[2012-11-28 21:09:23,960][TRACE][discovery.zen.ping.multicast] [Sugar Man]
[1] sending ping request
[2012-11-28 21:09:23,994][TRACE][discovery.zen.ping.multicast] [Sugar Man]
[1] received ping_response{target [[Blazing
Skull][SQANGh9lTcOPeb1vao8KdA][inet[/10.72.228.43:9300]]{client=true,
data=false}], master [[Tutinax the
Mountain-Mover][yO8eFKAHTl-TPJwIyw99WQ][inet[/10.72.228.41:9300]]{rack=rack1}],
cluster_name[elasticsearch]}
[2012-11-28 21:09:23,997][TRACE][discovery.zen.ping.multicast] [Sugar Man]
[1] received ping_response{target
[[Hellcat][iW1rJzyIQLCKEMXwALQRFw][inet[/10.72.228.43:9301]]{client=true,
data=false}], master [[Tutinax the
Mountain-Mover][yO8eFKAHTl-TPJwIyw99WQ][inet[/10.72.228.41:9300]]{rack=rack1}],
cluster_name[elasticsearch]}
[2012-11-28 21:09:28,996][TRACE][discovery.zen.ping.multicast] [Sugar Man]
[1] sending ping request
[2012-11-28 21:09:28,998][TRACE][discovery.zen.ping.multicast] [Sugar Man]
[1] received ping_response{target
[[Hellcat][iW1rJzyIQLCKEMXwALQRFw][inet[/10.72.228.43:9301]]{client=true,
data=false}], master [[Tutinax the
Mountain-Mover][yO8eFKAHTl-TPJwIyw99WQ][inet[/10.72.228.41:9300]]{rack=rack1}],
cluster_name[elasticsearch]}
[2012-11-28 21:09:28,998][TRACE][discovery.zen.ping.multicast] [Sugar Man]
[1] received ping_response{target [[Blazing
Skull][SQANGh9lTcOPeb1vao8KdA][inet[/10.72.228.43:9300]]{client=true,
data=false}], master [[Tutinax the
Mountain-Mover][yO8eFKAHTl-TPJwIyw99WQ][inet[/10.72.228.41:9300]]{rack=rack1}],
cluster_name[elasticsearch]}
[2012-11-28 21:09:34,000][TRACE][discovery.zen ] [Sugar Man]
full ping responses:
--> target
[[Hellcat][iW1rJzyIQLCKEMXwALQRFw][inet[/10.72.228.43:9301]]{client=true,
data=false}], master [[Tutinax the
Mountain-Mover][yO8eFKAHTl-TPJwIyw99WQ][inet[/10.72.228.41:9300]]{rack=rack1}]
--> target [[Blazing
Skull][SQANGh9lTcOPeb1vao8KdA][inet[/10.72.228.43:9300]]{client=true,
data=false}], master [[Tutinax the
Mountain-Mover][yO8eFKAHTl-TPJwIyw99WQ][inet[/10.72.228.41:9300]]{rack=rack1}]
[2012-11-28 21:09:34,001][DEBUG][discovery.zen ] [Sugar Man]
filtered ping responses: (filter_client[true], filter_data[false]) {none}
[2012-11-28 21:09:34,012][INFO ][cluster.service ] [Sugar Man]
new_master [Sugar
Man][hEdqYzszT-2Lma14eoeoiA][inet[/10.72.228.42:9300]]{rack=rack2}, reason:
zen-disco-join (elected_as_master)
[2012-11-28 21:09:34,044][TRACE][discovery ] [Sugar Man]
initial state set from discovery
[2012-11-28 21:09:34,044][INFO ][discovery ] [Sugar Man]
elasticsearch/hEdqYzszT-2Lma14eoeoiA
[2012-11-28 21:09:34,079][INFO ][http ] [Sugar Man]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.72.228.42:9200]}
[2012-11-28 21:09:34,081][INFO ][node ] [Sugar Man]
{0.19.8}[22503]: started
[2012-11-28 21:09:35,129][INFO ][gateway ] [Sugar Man]
recovered [3] indices into cluster_state
--