Split brain?

Darron_Froese · December 30, 2011, 3:22am

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

And then from the next day:

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.
How I can recover from this.
How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

kimchy · December 30, 2011, 11:32am

Heya:

First, regarding the split brain, it can obviously happen, especially with
2 servers. If the network gets disconnected between the two for example. In
this case, you will have two separate one node cluster and you will need to
resolve it yourself by restarting one of them. You should see in the logs
the fact that one node got disconnected from the other. If you had a larger
cluster you could have defined "minimum_master_nodes" parameter to reduce
the chances of it happening.

I am not sure why when you restart the node its not finding the other node.
I am assuming you are using unicast discovery, are you sure its configured
properly? You can set discovery: TRACE in the logging.yml file to see which
nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese darron@nonfiction.ca wrote:

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.

How I can recover from this.

How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

Darron_Froese · December 30, 2011, 7:39pm

I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

Should I put all of the IPs of all of the nodes in there?
If I go up to a 3 node cluster - should I put "minimum_master_nodes" to 2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon kimchy@gmail.com wrote:

Heya:

First, regarding the split brain, it can obviously happen, especially with 2
servers. If the network gets disconnected between the two for example. In
this case, you will have two separate one node cluster and you will need to
resolve it yourself by restarting one of them. You should see in the logs
the fact that one node got disconnected from the other. If you had a larger
cluster you could have defined "minimum_master_nodes" parameter to reduce
the chances of it happening.

I am not sure why when you restart the node its not finding the other node.
I am assuming you are using unicast discovery, are you sure its configured
properly? You can set discovery: TRACE in the logging.yml file to see which
nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese darron@nonfiction.ca wrote:

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.

How I can recover from this.

How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

kimchy · December 30, 2011, 8:40pm

Strange that multicast worked..., as far as I know its not supported in
Rackspace. Yea, you should put all the IPs in the unicast list, its
recommended if possible. If you go up to 3 nodes, then I think it make
sense to have minimum master nodes set to 2, yea.

On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese darron@nonfiction.ca wrote:

I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

Should I put all of the IPs of all of the nodes in there?

If I go up to a 3 node cluster - should I put "minimum_master_nodes" to
2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon kimchy@gmail.com wrote:

Heya:

First, regarding the split brain, it can obviously happen, especially
with 2
servers. If the network gets disconnected between the two for example. In
this case, you will have two separate one node cluster and you will need
to
resolve it yourself by restarting one of them. You should see in the logs
the fact that one node got disconnected from the other. If you had a
larger
cluster you could have defined "minimum_master_nodes" parameter to reduce
the chances of it happening.

I am not sure why when you restart the node its not finding the other
node.
I am assuming you are using unicast discovery, are you sure its
configured
properly? You can set discovery: TRACE in the logging.yml file to see
which
nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese darron@nonfiction.ca
wrote:

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.

How I can recover from this.

How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

Darron_Froese · December 30, 2011, 9:42pm

Yeah - it was working great - all configs are in a git repo and pushed
out via chef - been working great for a little over 3 weeks in
production - and a couple weeks before in testing.

I have a ticket into Rackspace to see if they have changed something -
but will just switch to unicast now.

Thanks for your help - will be updating my configs now.

On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon kimchy@gmail.com wrote:

Strange that multicast worked..., as far as I know its not supported in
Rackspace. Yea, you should put all the IPs in the unicast list, its
recommended if possible. If you go up to 3 nodes, then I think it make sense
to have minimum master nodes set to 2, yea.

On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese darron@nonfiction.ca wrote:

I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

Should I put all of the IPs of all of the nodes in there?

If I go up to a 3 node cluster - should I put "minimum_master_nodes" to
2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon kimchy@gmail.com wrote:

Heya:

First, regarding the split brain, it can obviously happen, especially
with 2
servers. If the network gets disconnected between the two for example.
In
this case, you will have two separate one node cluster and you will need
to
resolve it yourself by restarting one of them. You should see in the
logs
the fact that one node got disconnected from the other. If you had a
larger
cluster you could have defined "minimum_master_nodes" parameter to
reduce
the chances of it happening.

I am not sure why when you restart the node its not finding the other
node.
I am assuming you are using unicast discovery, are you sure its
configured
properly? You can set discovery: TRACE in the logging.yml file to see
which
nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese darron@nonfiction.ca
wrote:

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.

How I can recover from this.

How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

Darron_Froese · January 4, 2012, 8:26am

FYI - heard from Rackspace:

"We did recently implement multicast filtering on our new XenServer
Linux deployments. This was originally the intended design as having
multicast between all customers in the same huddles can be
problematic. I apologize that you used this as a feature before it was
blocked, but I feel it may ultimately the best with multicast
filtered.

Your timeline corresponds exactly with when I heard that the changes
were being rolled out."

Oh well - makes sense now.

On Fri, Dec 30, 2011 at 2:42 PM, Darron Froese darron@nonfiction.ca wrote:

Yeah - it was working great - all configs are in a git repo and pushed
out via chef - been working great for a little over 3 weeks in
production - and a couple weeks before in testing.

I have a ticket into Rackspace to see if they have changed something -
but will just switch to unicast now.

Thanks for your help - will be updating my configs now.

On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon kimchy@gmail.com wrote:

Strange that multicast worked..., as far as I know its not supported in
Rackspace. Yea, you should put all the IPs in the unicast list, its
recommended if possible. If you go up to 3 nodes, then I think it make sense
to have minimum master nodes set to 2, yea.

On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese darron@nonfiction.ca wrote:

I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

Should I put all of the IPs of all of the nodes in there?

If I go up to a 3 node cluster - should I put "minimum_master_nodes" to
2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon kimchy@gmail.com wrote:

Heya:

First, regarding the split brain, it can obviously happen, especially
with 2
servers. If the network gets disconnected between the two for example.
In
this case, you will have two separate one node cluster and you will need
to
resolve it yourself by restarting one of them. You should see in the
logs
the fact that one node got disconnected from the other. If you had a
larger
cluster you could have defined "minimum_master_nodes" parameter to
reduce
the chances of it happening.

I am not sure why when you restart the node its not finding the other
node.
I am assuming you are using unicast discovery, are you sure its
configured
properly? You can set discovery: TRACE in the logging.yml file to see
which
nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese darron@nonfiction.ca
wrote:

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5 for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun up
a new box and tried to get it to join the cluster, but no dice - it's
like none of the other boxes exist. I've also tried to go back down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.

How I can recover from this.

How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

kimchy · January 4, 2012, 11:16am

Interesting!

On Wed, Jan 4, 2012 at 10:26 AM, Darron Froese darron@nonfiction.ca wrote:

FYI - heard from Rackspace:

"We did recently implement multicast filtering on our new XenServer
Linux deployments. This was originally the intended design as having
multicast between all customers in the same huddles can be
problematic. I apologize that you used this as a feature before it was
blocked, but I feel it may ultimately the best with multicast
filtered.

Your timeline corresponds exactly with when I heard that the changes
were being rolled out."

Oh well - makes sense now.

On Fri, Dec 30, 2011 at 2:42 PM, Darron Froese darron@nonfiction.ca
wrote:

Yeah - it was working great - all configs are in a git repo and pushed
out via chef - been working great for a little over 3 weeks in
production - and a couple weeks before in testing.

I have a ticket into Rackspace to see if they have changed something -
but will just switch to unicast now.

Thanks for your help - will be updating my configs now.

On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon kimchy@gmail.com wrote:

Strange that multicast worked..., as far as I know its not supported in
Rackspace. Yea, you should put all the IPs in the unicast list, its
recommended if possible. If you go up to 3 nodes, then I think it make
sense
to have minimum master nodes set to 2, yea.

On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese darron@nonfiction.ca
wrote:

I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

Should I put all of the IPs of all of the nodes in there?

If I go up to a 3 node cluster - should I put
"minimum_master_nodes" to
2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon kimchy@gmail.com wrote:

Heya:

First, regarding the split brain, it can obviously happen, especially
with 2
servers. If the network gets disconnected between the two for
example.
In
this case, you will have two separate one node cluster and you will
need
to
resolve it yourself by restarting one of them. You should see in the
logs
the fact that one node got disconnected from the other. If you had a
larger
cluster you could have defined "minimum_master_nodes" parameter to
reduce
the chances of it happening.

I am not sure why when you restart the node its not finding the other
node.
I am assuming you are using unicast discovery, are you sure its
configured
properly? You can set discovery: TRACE in the logging.yml file to see
which
nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <darron@nonfiction.ca

wrote:

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5
for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to
work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun
up
a new box and tried to get it to join the cluster, but no dice -
it's
like none of the other boxes exist. I've also tried to go back down
to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and
say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but I'm
a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.

How I can recover from this.

How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

Stanislas_Polu · January 4, 2012, 11:31am

Very interesting! Thanks!

-stan

--
Stanislas Polu
Mo: +33 6 83 71 90 04 | Tw: @spolu | http://teleportd.com | Realtime Photo
Search

On Wed, Jan 4, 2012 at 12:16 PM, Shay Banon kimchy@gmail.com wrote:

Interesting!

On Wed, Jan 4, 2012 at 10:26 AM, Darron Froese darron@nonfiction.cawrote:

FYI - heard from Rackspace:

"We did recently implement multicast filtering on our new XenServer
Linux deployments. This was originally the intended design as having
multicast between all customers in the same huddles can be
problematic. I apologize that you used this as a feature before it was
blocked, but I feel it may ultimately the best with multicast
filtered.

Your timeline corresponds exactly with when I heard that the changes
were being rolled out."

Oh well - makes sense now.

On Fri, Dec 30, 2011 at 2:42 PM, Darron Froese darron@nonfiction.ca
wrote:

Yeah - it was working great - all configs are in a git repo and pushed
out via chef - been working great for a little over 3 weeks in
production - and a couple weeks before in testing.

I have a ticket into Rackspace to see if they have changed something -
but will just switch to unicast now.

Thanks for your help - will be updating my configs now.

On Fri, Dec 30, 2011 at 1:40 PM, Shay Banon kimchy@gmail.com wrote:

Strange that multicast worked..., as far as I know its not supported in
Rackspace. Yea, you should put all the IPs in the unicast list, its
recommended if possible. If you go up to 3 nodes, then I think it make
sense
to have minimum master nodes set to 2, yea.

On Fri, Dec 30, 2011 at 9:39 PM, Darron Froese darron@nonfiction.ca
wrote:

I was using multicast discovery and it was working great before -
heres the log with extra debugging:

http://d.pr/9pWC

It looks like it didn't find anything at all.

So I setup unicast and added the master to the config:

http://d.pr/NxBE

And it seemed to work:

http://d.pr/jxjW
http://d.pr/Ubzi

I can switch to using unicast - that's no problem - just not sure why
that happened - maybe Rackspace made some network changes - not sure
why it worked for almost a month and suddenly stopped.

A couple questions.

Should I put all of the IPs of all of the nodes in there?

If I go up to a 3 node cluster - should I put
"minimum_master_nodes" to
2?

Thanks for the tips Shay - really appreciate it.

On Fri, Dec 30, 2011 at 4:32 AM, Shay Banon kimchy@gmail.com wrote:

Heya:

First, regarding the split brain, it can obviously happen,
especially
with 2
servers. If the network gets disconnected between the two for
example.
In
this case, you will have two separate one node cluster and you will
need
to
resolve it yourself by restarting one of them. You should see in the
logs
the fact that one node got disconnected from the other. If you had a
larger
cluster you could have defined "minimum_master_nodes" parameter to
reduce
the chances of it happening.

I am not sure why when you restart the node its not finding the
other
node.
I am assuming you are using unicast discovery, are you sure its
configured
properly? You can set discovery: TRACE in the logging.yml file to
see
which
nodes it tries to ping and what the status of that is.

-shay.banon

On Fri, Dec 30, 2011 at 5:22 AM, Darron Froese <
darron@nonfiction.ca>
wrote:

I had a cluster of 2 x 2GB Rackspace cloud boxes running ES 0.18.5
for
the last 3 weeks. It's been working great and we've had several
million records inserted and deleted in that time.

Here is the config:

http://d.pr/WbU7

Yesterday I updated the boxes to 0.18.6 (and was trying to get the
boxes to log to syslog as well) - it appears that something didn't
work so well during the upgrade and I was left with 2 boxes both
thinking that they're masters.

Here are the logs from the boxes during the upgrade:

http://d.pr/HHbC
http://d.pr/n3jk

And then from the next day:

http://d.pr/QEkU
http://d.pr/I7fo

I tried to get them to re-connect, but I couldn't get anything to
work
correctly - they were both completely separate.

I now have a single box with the correct index:

http://d.pr/4TTq

I have a chef recipe that builds new elasticsearch boxes so I spun
up
a new box and tried to get it to join the cluster, but no dice -
it's
like none of the other boxes exist. I've also tried to go back
down to
0.18.5 - no dice.

Is there a way I can point a new box at that master directly and
say:
"Hey you're a slave, there is the master."

I'm a bit of a loss here - and don't want to admit defeat - but
I'm a
little lost here.

It's a system that we're building and I CAN lose the data, but I
really want to understand:

Why this happened.

How I can recover from this.

How I can prevent this from happening in the future.

Thanks in advance if anybody can point to something I will greatly
appreciate it.

Topic		Replies	Views
Cluster is broken Elasticsearch	10	674	July 6, 2017
Network outage keeps split brain status (no recovery by ES) (was issue #5144) Elasticsearch	7	1126	July 6, 2017
Nodes fail to join cluster - potential split brain scenario Elasticsearch	11	567	July 6, 2017
Split brain problem in 2 node elasticsearch cluster Elasticsearch	7	1128	July 6, 2017
Split brain due to 'on the fence' network partition Elasticsearch	5	772	July 6, 2017

Split brain?

Related topics