Strange split brain scenario

Hi,
We're using version 0.17.6 with two servers and had a strange problem. One
of the server identified itself (es1-01) as the master and the peer
(es1-02) as the slave:
{
"cluster_name" : "gcs",
"master_node" : "j0VcAKNRSsKnbHyaeof6pQ",
"blocks" : {
},
"nodes" : {
"zi6VNqehTCaZjNS0nbGUhg" : {
"name" : "es1-02",
"transport_address" : "inet[/10.1.101.152:9300]",
"attributes" : {
}
},
"j0VcAKNRSsKnbHyaeof6pQ" : {
"name" : "es1-01",
"transport_address" : "inet[/10.1.101.151:9300]",
"attributes" : {
}
}
},

While the other one only saw itself:
{
"cluster_name" : "gcs",
"master_node" : "zi6VNqehTCaZjNS0nbGUhg",
"blocks" : {
},
"nodes" : {
"zi6VNqehTCaZjNS0nbGUhg" : {
"name" : "es1-02",
"transport_address" : "inet[/10.1.101.152:9300]",
"attributes" : {
}
}
},

Only resetting es1-02 caused it to properly identify the other server.

Now es1-01 is the master of all the shards in all the indexes, and 3 days
after the problem it didn't rebalance, is that expected? Is there a way to
force rebalancing?

Thanks,
Eran

Do you have the logs from those two servers?

On Sun, Nov 6, 2011 at 6:18 PM, Eran Kutner eran@gigya-inc.com wrote:

Hi,
We're using version 0.17.6 with two servers and had a strange problem. One
of the server identified itself (es1-01) as the master and the peer
(es1-02) as the slave:
{
"cluster_name" : "gcs",
"master_node" : "j0VcAKNRSsKnbHyaeof6pQ",
"blocks" : {
},
"nodes" : {
"zi6VNqehTCaZjNS0nbGUhg" : {
"name" : "es1-02",
"transport_address" : "inet[/10.1.101.152:9300]",
"attributes" : {
}
},
"j0VcAKNRSsKnbHyaeof6pQ" : {
"name" : "es1-01",
"transport_address" : "inet[/10.1.101.151:9300]",
"attributes" : {
}
}
},

While the other one only saw itself:
{
"cluster_name" : "gcs",
"master_node" : "zi6VNqehTCaZjNS0nbGUhg",
"blocks" : {
},
"nodes" : {
"zi6VNqehTCaZjNS0nbGUhg" : {
"name" : "es1-02",
"transport_address" : "inet[/10.1.101.152:9300]",
"attributes" : {
}
}
},

Only resetting es1-02 caused it to properly identify the other server.

Now es1-01 is the master of all the shards in all the indexes, and 3 days
after the problem it didn't rebalance, is that expected? Is there a way to
force rebalancing?

Thanks,
Eran

The relevant parts of the logs from es1-02 are here:
http://pastebin.com/aZshxLqn
and the parts from es1-01 are here: http://pastebin.com/et1MWMDG

Note that the reset we initiated was on Nov. 2nd at 4:24am. I can't be sure
but we don't recall resetting the service on Nov. 1st at 00:21 when the log
of es1-01 indicates a reset, also, the logs doesn't show a "stopping"
message before those lines. Does ES has some built in watchdog that could
do this?

Let me know if there is any additional information I can provide.

Thanks.

-eran