1.4.0 data node can't join existing 1.3.4 cluster

Mads_Martin_Jorgense · November 21, 2014, 11:02am

Cool

Usually this means a fix will emerge. Thanks!

On Friday, November 21, 2014 10:07:03 AM UTC+1, Mark Walkom wrote:

It's being looked at, but I don't know much beyond that at the moment
sorry.

On 21 November 2014 20:02, <madsm...@colourbox.com <javascript:>> wrote:

Is there any of the elasticsearch team members that can hint to whether
or not this is something that will be fixed in 1.4.1? Then we'll simply
wait for it instead of doing different hacks to upgrade.

On Monday, November 17, 2014 12:35:03 PM UTC+1, Matthew Barrington wrote:

I stand corrected, this did not work on our main cluster.

On Monday, 17 November 2014 11:13:22 UTC, Matthew Barrington wrote:

We are running a 1.3.4 cluster using the AWS plugin and I noticed the
same error when I tried to upgrade a single node.

Since I was trying this on my test cluster first I decided to see what
would happen if I upgraded a 2nd node. Would it split into 2 clusters, have
the same issue, etc.

What I discovered was that when 2 nodes were upgraded to 1.4 they
joined the cluster correctly and everything looks to be working.

SO the problem seems to be for the initial node to join, but when you
try with two everything works out.

On Friday, 14 November 2014 18:05:01 UTC, Eric Jain wrote:

On Fri, Nov 14, 2014 at 3:41 AM, madsm...@colourbox.com wrote:

I'm also seing this problem when a 1.4.0 node tries joining a 1.3.4
cluster
with cloud-aws plugin version 2.4.0. Is there a workaround to use
during
upgrade, since I assume it's not a problem when they're all upgraded
to
1.4.0.

I ended up starting a new cluster (ignoring all the warnings logged on
startup), and restoring from a snapshot. Once all the 1.3.4 nodes were
gone, no issues.

--
Eric Jain
Got data? Get answers at zenobase.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/312dcdc1-d826-4cb9-b480-620232634ea7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/312dcdc1-d826-4cb9-b480-620232634ea7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8decfa8a-7583-41ad-ba0f-f7982e49b73d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · November 21, 2014, 3:39pm

Has an official issue been created? I would like to track the status.

So far, every 1.x.0 release has been buggy.

--
Ivan

On Fri, Nov 21, 2014 at 4:06 AM, Mark Walkom markwalkom@gmail.com wrote:

It's being looked at, but I don't know much beyond that at the moment
sorry.

On 21 November 2014 20:02, madsmartin@colourbox.com wrote:

Is there any of the elasticsearch team members that can hint to whether
or not this is something that will be fixed in 1.4.1? Then we'll simply
wait for it instead of doing different hacks to upgrade.

On Monday, November 17, 2014 12:35:03 PM UTC+1, Matthew Barrington wrote:

I stand corrected, this did not work on our main cluster.

On Monday, 17 November 2014 11:13:22 UTC, Matthew Barrington wrote:

We are running a 1.3.4 cluster using the AWS plugin and I noticed the
same error when I tried to upgrade a single node.

Since I was trying this on my test cluster first I decided to see what
would happen if I upgraded a 2nd node. Would it split into 2 clusters, have
the same issue, etc.

What I discovered was that when 2 nodes were upgraded to 1.4 they
joined the cluster correctly and everything looks to be working.

SO the problem seems to be for the initial node to join, but when you
try with two everything works out.

On Friday, 14 November 2014 18:05:01 UTC, Eric Jain wrote:

On Fri, Nov 14, 2014 at 3:41 AM, madsm...@colourbox.com wrote:

I'm also seing this problem when a 1.4.0 node tries joining a 1.3.4
cluster
with cloud-aws plugin version 2.4.0. Is there a workaround to use
during
upgrade, since I assume it's not a problem when they're all upgraded
to
1.4.0.

I ended up starting a new cluster (ignoring all the warnings logged on
startup), and restoring from a snapshot. Once all the 1.3.4 nodes were
gone, no issues.

--
Eric Jain
Got data? Get answers at zenobase.com.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/312dcdc1-d826-4cb9-b480-620232634ea7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/312dcdc1-d826-4cb9-b480-620232634ea7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAF3ZnZn-ryDDoQps-smzUPkJd5ru9EHfKuAGRReU2-J-C35kvA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAF3ZnZn-ryDDoQps-smzUPkJd5ru9EHfKuAGRReU2-J-C35kvA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDZ8k8GQQJn89V4W4S1Bm%3DDKfgMBsaB6a%2B4TcFr67nJkg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

jprante · November 22, 2014, 5:04pm

As said, the change is due to unicast action, which was split in 1.4.0 to
an old and a new action, see this commit:

github.com/elastic/elasticsearch

Discovery: back port #7558 to 1.x and add bwc protections of the new ping on master gone introduced in #7493

committed 03:38PM - 16 Sep 14 UTC

bleskes

+256 -79

The change in #7558 adds a flag to PingResponse. However, when unicast discovery… is used, this extra flag can not be serialized by the very initial pings as they do not know yet what node version they ping (i.e., they have to default to 1.0.0, which excludes changing the serialization format). This commit bypasses this problem by adding a dedicated action which only exist on nodes of version 1.4 or up. Nodes first try to ping this endpoint using 1.4.0 as a serialization version. If that fails they fall back to the pre 1.4.0 action. This is optimal if all nodes are on 1.4.0 or higher, with a small down side if the cluster has mixed versions - but this is a temporary state. Further two bwc protections are added: 1) Disable the preference to nodes who previously joined the cluster if some of the pings are on version < 1.4.0 2) Disable the rejoin on master gone functionality if some nodes in the cluster or version < 1.4.0 Closes #7694

https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fcommit%2Fe5de47d928582694c7729d199390086983779e6e&sa=D&sntz=1&usg=AFQjCNFQkgiVz8SfE_dZ5Sa5K7TqYCIQ6g

I am not sure if this is a bug. It seems like a feature to prevent multiple
masters by accident.

The strategy as described above by Christian Hedegaard should work, it is
still to be considered a work-around:

setting up all new 1.4 nodes as not master eligible ("data only")
joining them to a 1.3.x cluster while master still is on a 1.3 node
should work
then, shutting down all 1.3 nodes (except the master) should relocate the
shards
bringing down the final 1.3 master should "stall" master election (I
would also configure a large timeout for master election). This is
critical, no index/mapping creations/deletions or cluster state modifying
actions should be executed now.
adding a 1.4 master eligible node should now overtake the cluster (I
would start it with the data folder from the final 1.3 master where the
last cluster state is persisted) and the critical phase is over.
from then, more 1.4 master eligible nodes should be possible to add
finally, the minimum master nodes setting should be configured

Jörg

On Fri, Nov 21, 2014 at 1:56 AM, Christian Hedegaard <
chedegaard@red5studios.com> wrote:

FYI, I have found a solution that works (at least for me).

I’ve got a small cluster for testing, only 4 v1.3.5 nodes. What I’ve done
is bring up 4X new v1.4.0 nodes as data-only machines. In the yaml I added
a line to point the nodes via unicast explicitly to the current master:

discovery.zen.ping.unicast.hosts: ["10.210.9.224:9300"]

When I restarted elasticsearch with that setting, with cloud-aws installed
and configured on version 2.4.0, the new nodes found the cluster and
properly joined it.

I will now start nuking the old v1.3.5 nodes to migrate the data off of
them. Before the final 1.3.5 node is nuked, I will change the config on one
of the v1.4.0 nodes to allow it as master and restart it.

I’m not sure if the master stuff is needed or not, but I was very afraid
of a split-brain problem. I have another 4-node testing cluster that I will
be able to try this upgrade again with in a more controlled manner.

I’m NOT looking forward to upgrading our current production cluster this
way (15 data-only nodes, 3 master-only nodes).

So it would appear that the problem is somewhere in the unicast discovery
code. The question is who’s to blame? Elasticsearch or the cloud-aws plugin?

From: Boaz Leskes [mailto:b.leskes@gmail.com]
Sent: Wednesday, November 19, 2014 2:27 PM
To: elasticsearch@googlegroups.com
Cc: Christian Hedegaard
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Hi Christian,

I'm not sure what thread you refer to exactly, but this shouldn't happen.
Can you describe the problem you have some more? Anything in the nodes?
(both the 1.4 node and the master)

Cheers,

Boaz

On Wednesday, November 19, 2014 2:39:57 AM UTC+1, Christian Hedegaard
wrote:

I found this thread while trying to research the same issue and it looks
like there is currently no resolution. We like to keep up on our
elasticsearch upgrades as often as possible and do rolling upgrades to keep
our clusters up. When testing I’m having the same issue, I cannot add a
1.4.0 box to the existing 1.3.4 cluster.

Is there a fix for this anticipated?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com
https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFYYBDQBXVhr-egAspoY_sy6PqW1wBh%3DPAEGM%2B1%2Bab%2Bew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Boaz_Leskes · November 22, 2014, 9:36pm

Hi All,

I believe I found the source of the problem and it has to do with the AWS
plugin. I opened an issue for it, which should be pretty easy to
fix: AwsEc2UnicastHostsProvider should use version.minimumCompatibilityVersion() · Issue #143 · elastic/elasticsearch-cloud-aws · GitHub .

Cheers,
Boaz

On Friday, November 21, 2014 5:39:32 PM UTC+2, Ivan Brusic wrote:

Has an official issue been created? I would like to track the status.

So far, every 1.x.0 release has been buggy.

--
Ivan

On Fri, Nov 21, 2014 at 4:06 AM, Mark Walkom markwalkom@gmail.com wrote:

It's being looked at, but I don't know much beyond that at the moment
sorry.

On 21 November 2014 20:02, madsmartin@colourbox.com wrote:

Is there any of the elasticsearch team members that can hint to whether
or not this is something that will be fixed in 1.4.1? Then we'll simply
wait for it instead of doing different hacks to upgrade.

On Monday, November 17, 2014 12:35:03 PM UTC+1, Matthew Barrington wrote:

I stand corrected, this did not work on our main cluster.

On Monday, 17 November 2014 11:13:22 UTC, Matthew Barrington wrote:

We are running a 1.3.4 cluster using the AWS plugin and I noticed the
same error when I tried to upgrade a single node.

Since I was trying this on my test cluster first I decided to see what
would happen if I upgraded a 2nd node. Would it split into 2 clusters, have
the same issue, etc.

What I discovered was that when 2 nodes were upgraded to 1.4 they
joined the cluster correctly and everything looks to be working.

SO the problem seems to be for the initial node to join, but when you
try with two everything works out.

On Friday, 14 November 2014 18:05:01 UTC, Eric Jain wrote:

On Fri, Nov 14, 2014 at 3:41 AM, madsm...@colourbox.com wrote:

I'm also seing this problem when a 1.4.0 node tries joining a 1.3.4
cluster
with cloud-aws plugin version 2.4.0. Is there a workaround to use
during
upgrade, since I assume it's not a problem when they're all
upgraded to
1.4.0.

I ended up starting a new cluster (ignoring all the warnings logged
on
startup), and restoring from a snapshot. Once all the 1.3.4 nodes
were
gone, no issues.

--
Eric Jain
Got data? Get answers at zenobase.com.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/312dcdc1-d826-4cb9-b480-620232634ea7%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/312dcdc1-d826-4cb9-b480-620232634ea7%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAF3ZnZn-ryDDoQps-smzUPkJd5ru9EHfKuAGRReU2-J-C35kvA%40mail.gmail.com
https://groups.google.com/d/msgid/elasticsearch/CAF3ZnZn-ryDDoQps-smzUPkJd5ru9EHfKuAGRReU2-J-C35kvA%40mail.gmail.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/fd080846-ce10-4e9d-885a-8ad406e03944%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Boaz_Leskes · November 22, 2014, 9:42pm

Hi Christian, Daniel,

I believe I found the issue - it has to do with the cloud plugins (both AWS
and GCE) and the way they create the node list for the unicast based
discovery. Effectively they mislead it to think that that all nodes on the
cluster are version 1.4.0 which is not correct.

I opened issues for this so it will be corrected
soon: AwsEc2UnicastHostsProvider should use version.minimumCompatibilityVersion() · Issue #143 · elastic/elasticsearch-cloud-aws · GitHub
, UnicastHostsProvider should use version.minimumCompatibilityVersion() · Issue #41 · elastic/elasticsearch-cloud-gce · GitHub

Cheers,
Boaz

On Saturday, November 22, 2014 7:04:33 PM UTC+2, Jörg Prante wrote:

As said, the change is due to unicast action, which was split in 1.4.0 to
an old and a new action, see this commit:

Discovery: back port #7558 to 1.x and add bwc protections of the new … · elastic/elasticsearch@e5de47d · GitHub
https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fcommit%2Fe5de47d928582694c7729d199390086983779e6e&sa=D&sntz=1&usg=AFQjCNFQkgiVz8SfE_dZ5Sa5K7TqYCIQ6g

I am not sure if this is a bug. It seems like a feature to prevent
multiple masters by accident.

The strategy as described above by Christian Hedegaard should work, it is
still to be considered a work-around:

setting up all new 1.4 nodes as not master eligible ("data only")

joining them to a 1.3.x cluster while master still is on a 1.3 node
should work

then, shutting down all 1.3 nodes (except the master) should relocate
the shards

bringing down the final 1.3 master should "stall" master election (I
would also configure a large timeout for master election). This is
critical, no index/mapping creations/deletions or cluster state modifying
actions should be executed now.

adding a 1.4 master eligible node should now overtake the cluster (I
would start it with the data folder from the final 1.3 master where the
last cluster state is persisted) and the critical phase is over.

from then, more 1.4 master eligible nodes should be possible to add

finally, the minimum master nodes setting should be configured

Jörg

On Fri, Nov 21, 2014 at 1:56 AM, Christian Hedegaard <
chedegaard@red5studios.com> wrote:

FYI, I have found a solution that works (at least for me).

I’ve got a small cluster for testing, only 4 v1.3.5 nodes. What I’ve done
is bring up 4X new v1.4.0 nodes as data-only machines. In the yaml I added
a line to point the nodes via unicast explicitly to the current master:

discovery.zen.ping.unicast.hosts: ["10.210.9.224:9300"]

When I restarted elasticsearch with that setting, with cloud-aws
installed and configured on version 2.4.0, the new nodes found the cluster
and properly joined it.

I will now start nuking the old v1.3.5 nodes to migrate the data off of
them. Before the final 1.3.5 node is nuked, I will change the config on one
of the v1.4.0 nodes to allow it as master and restart it.

I’m not sure if the master stuff is needed or not, but I was very afraid
of a split-brain problem. I have another 4-node testing cluster that I will
be able to try this upgrade again with in a more controlled manner.

I’m NOT looking forward to upgrading our current production cluster this
way (15 data-only nodes, 3 master-only nodes).

So it would appear that the problem is somewhere in the unicast discovery
code. The question is who’s to blame? Elasticsearch or the cloud-aws plugin?

From: Boaz Leskes [mailto:b.leskes@gmail.com]
Sent: Wednesday, November 19, 2014 2:27 PM
To: elasticsearch@googlegroups.com
Cc: Christian Hedegaard
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Hi Christian,

I'm not sure what thread you refer to exactly, but this shouldn't happen.
Can you describe the problem you have some more? Anything in the nodes?
(both the 1.4 node and the master)

Cheers,

Boaz

On Wednesday, November 19, 2014 2:39:57 AM UTC+1, Christian Hedegaard
wrote:

I found this thread while trying to research the same issue and it looks
like there is currently no resolution. We like to keep up on our
elasticsearch upgrades as often as possible and do rolling upgrades to keep
our clusters up. When testing I’m having the same issue, I cannot add a
1.4.0 box to the existing 1.3.4 cluster.

Is there a fix for this anticipated?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com
https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ivan · November 23, 2014, 2:10am

Great work everyone. Feel better about upgrading now.
On Nov 22, 2014 4:42 PM, "Boaz Leskes" b.leskes@gmail.com wrote:

Hi Christian, Daniel,

I believe I found the issue - it has to do with the cloud plugins (both
AWS and GCE) and the way they create the node list for the unicast based
discovery. Effectively they mislead it to think that that all nodes on the
cluster are version 1.4.0 which is not correct.

I opened issues for this so it will be corrected soon:
AwsEc2UnicastHostsProvider should use version.minimumCompatibilityVersion() · Issue #143 · elastic/elasticsearch-cloud-aws · GitHub ,
UnicastHostsProvider should use version.minimumCompatibilityVersion() · Issue #41 · elastic/elasticsearch-cloud-gce · GitHub

Cheers,
Boaz

On Saturday, November 22, 2014 7:04:33 PM UTC+2, Jörg Prante wrote:

As said, the change is due to unicast action, which was split in 1.4.0 to
an old and a new action, see this commit:

Add docs for the include_named_queries_score param (#103155) · elastic/elasticsearch@47b5753 · GitHub
e5de47d928582694c7729d199390086983779e6e
https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fcommit%2Fe5de47d928582694c7729d199390086983779e6e&sa=D&sntz=1&usg=AFQjCNFQkgiVz8SfE_dZ5Sa5K7TqYCIQ6g

I am not sure if this is a bug. It seems like a feature to prevent
multiple masters by accident.

The strategy as described above by Christian Hedegaard should work, it is
still to be considered a work-around:

setting up all new 1.4 nodes as not master eligible ("data only")

joining them to a 1.3.x cluster while master still is on a 1.3 node
should work

then, shutting down all 1.3 nodes (except the master) should relocate
the shards

bringing down the final 1.3 master should "stall" master election (I
would also configure a large timeout for master election). This is
critical, no index/mapping creations/deletions or cluster state modifying
actions should be executed now.

adding a 1.4 master eligible node should now overtake the cluster (I
would start it with the data folder from the final 1.3 master where the
last cluster state is persisted) and the critical phase is over.

from then, more 1.4 master eligible nodes should be possible to add

finally, the minimum master nodes setting should be configured

Jörg

On Fri, Nov 21, 2014 at 1:56 AM, Christian Hedegaard <
chedegaard@red5studios.com> wrote:

FYI, I have found a solution that works (at least for me).

I’ve got a small cluster for testing, only 4 v1.3.5 nodes. What I’ve
done is bring up 4X new v1.4.0 nodes as data-only machines. In the yaml I
added a line to point the nodes via unicast explicitly to the current
master:

discovery.zen.ping.unicast.hosts: ["10.210.9.224:9300"]

When I restarted elasticsearch with that setting, with cloud-aws
installed and configured on version 2.4.0, the new nodes found the cluster
and properly joined it.

I will now start nuking the old v1.3.5 nodes to migrate the data off of
them. Before the final 1.3.5 node is nuked, I will change the config on one
of the v1.4.0 nodes to allow it as master and restart it.

I’m not sure if the master stuff is needed or not, but I was very afraid
of a split-brain problem. I have another 4-node testing cluster that I will
be able to try this upgrade again with in a more controlled manner.

I’m NOT looking forward to upgrading our current production cluster this
way (15 data-only nodes, 3 master-only nodes).

So it would appear that the problem is somewhere in the unicast
discovery code. The question is who’s to blame? Elasticsearch or the
cloud-aws plugin?

From: Boaz Leskes [mailto:b.leskes@gmail.com]
Sent: Wednesday, November 19, 2014 2:27 PM
To: elasticsearch@googlegroups.com
Cc: Christian Hedegaard
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Hi Christian,

I'm not sure what thread you refer to exactly, but this shouldn't
happen. Can you describe the problem you have some more? Anything in the
nodes? (both the 1.4 node and the master)

Cheers,

Boaz

On Wednesday, November 19, 2014 2:39:57 AM UTC+1, Christian Hedegaard
wrote:

I found this thread while trying to research the same issue and it looks
like there is currently no resolution. We like to keep up on our
elasticsearch upgrades as often as possible and do rolling upgrades to keep
our clusters up. When testing I’m having the same issue, I cannot add a
1.4.0 box to the existing 1.3.4 cluster.

Is there a fix for this anticipated?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D2
2B4EBF409B%40s-us-ex-6.US.R5S.com
https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQB3NEWi_-K37C_Hu5QTLza9HdbEBcsSPOAWGFtz2MD50Q%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · November 24, 2014, 4:31pm

Heya,

We will release aws plugin 2.4.1 in some minutes.
It fixes this rolling upgrade issue.

Note that some WARN messages could appear in old nodes logs until the full
rolling upgrade is done.

Thank you all for reporting this!

Le dimanche 23 novembre 2014 03:10:42 UTC+1, Ivan Brusic a écrit :

Great work everyone. Feel better about upgrading now.
On Nov 22, 2014 4:42 PM, "Boaz Leskes" b.leskes@gmail.com wrote:

Hi Christian, Daniel,

I believe I found the issue - it has to do with the cloud plugins (both
AWS and GCE) and the way they create the node list for the unicast based
discovery. Effectively they mislead it to think that that all nodes on the
cluster are version 1.4.0 which is not correct.

I opened issues for this so it will be corrected soon:
AwsEc2UnicastHostsProvider should use version.minimumCompatibilityVersion() · Issue #143 · elastic/elasticsearch-cloud-aws · GitHub ,
UnicastHostsProvider should use version.minimumCompatibilityVersion() · Issue #41 · elastic/elasticsearch-cloud-gce · GitHub

Cheers,
Boaz

On Saturday, November 22, 2014 7:04:33 PM UTC+2, Jörg Prante wrote:

As said, the change is due to unicast action, which was split in 1.4.0
to an old and a new action, see this commit:

Add docs for the include_named_queries_score param (#103155) · elastic/elasticsearch@47b5753 · GitHub
e5de47d928582694c7729d199390086983779e6e
https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fcommit%2Fe5de47d928582694c7729d199390086983779e6e&sa=D&sntz=1&usg=AFQjCNFQkgiVz8SfE_dZ5Sa5K7TqYCIQ6g

I am not sure if this is a bug. It seems like a feature to prevent
multiple masters by accident.

The strategy as described above by Christian Hedegaard should work, it
is still to be considered a work-around:

setting up all new 1.4 nodes as not master eligible ("data only")

joining them to a 1.3.x cluster while master still is on a 1.3 node
should work

then, shutting down all 1.3 nodes (except the master) should relocate
the shards

bringing down the final 1.3 master should "stall" master election (I
would also configure a large timeout for master election). This is
critical, no index/mapping creations/deletions or cluster state modifying
actions should be executed now.

adding a 1.4 master eligible node should now overtake the cluster (I
would start it with the data folder from the final 1.3 master where the
last cluster state is persisted) and the critical phase is over.

from then, more 1.4 master eligible nodes should be possible to add

finally, the minimum master nodes setting should be configured

Jörg

On Fri, Nov 21, 2014 at 1:56 AM, Christian Hedegaard <
chedegaard@red5studios.com> wrote:

FYI, I have found a solution that works (at least for me).

I’ve got a small cluster for testing, only 4 v1.3.5 nodes. What I’ve
done is bring up 4X new v1.4.0 nodes as data-only machines. In the yaml I
added a line to point the nodes via unicast explicitly to the current
master:

discovery.zen.ping.unicast.hosts: ["10.210.9.224:9300"]

When I restarted elasticsearch with that setting, with cloud-aws
installed and configured on version 2.4.0, the new nodes found the cluster
and properly joined it.

I will now start nuking the old v1.3.5 nodes to migrate the data off of
them. Before the final 1.3.5 node is nuked, I will change the config on one
of the v1.4.0 nodes to allow it as master and restart it.

I’m not sure if the master stuff is needed or not, but I was very
afraid of a split-brain problem. I have another 4-node testing cluster that
I will be able to try this upgrade again with in a more controlled manner.

I’m NOT looking forward to upgrading our current production cluster
this way (15 data-only nodes, 3 master-only nodes).

So it would appear that the problem is somewhere in the unicast
discovery code. The question is who’s to blame? Elasticsearch or the
cloud-aws plugin?

From: Boaz Leskes [mailto:b.leskes@gmail.com]
Sent: Wednesday, November 19, 2014 2:27 PM
To: elasticsearch@googlegroups.com
Cc: Christian Hedegaard
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Hi Christian,

I'm not sure what thread you refer to exactly, but this shouldn't
happen. Can you describe the problem you have some more? Anything in the
nodes? (both the 1.4 node and the master)

Cheers,

Boaz

On Wednesday, November 19, 2014 2:39:57 AM UTC+1, Christian Hedegaard
wrote:

I found this thread while trying to research the same issue and it
looks like there is currently no resolution. We like to keep up on our
elasticsearch upgrades as often as possible and do rolling upgrades to keep
our clusters up. When testing I’m having the same issue, I cannot add a
1.4.0 box to the existing 1.3.4 cluster.

Is there a fix for this anticipated?

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D2
2B4EBF409B%40s-us-ex-6.US.R5S.com
https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c6a4d157-8f10-485d-a52d-a6cc192e08ef%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christian_Hedegaard · November 24, 2014, 5:47pm

Awesome! I’ll monitor the cloud-aws plugins github. Once they get a fix I can test it out on another one of our testing clusters.

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of Boaz Leskes
Sent: Saturday, November 22, 2014 1:43 PM
To: elasticsearch@googlegroups.com
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Hi Christian, Daniel,

I believe I found the issue - it has to do with the cloud plugins (both AWS and GCE) and the way they create the node list for the unicast based discovery. Effectively they mislead it to think that that all nodes on the cluster are version 1.4.0 which is not correct.

I opened issues for this so it will be corrected soon: https://github.com/elasticsearch/elasticsearch-cloud-aws/issues/143 , https://github.com/elasticsearch/elasticsearch-cloud-gce/issues/41

Cheers,
Boaz

On Saturday, November 22, 2014 7:04:33 PM UTC+2, Jörg Prante wrote:
As said, the change is due to unicast action, which was split in 1.4.0 to an old and a new action, see this commit:

https://github.com/elasticsearch/elasticsearch/commit/e5de47d928582694c7729d199390086983779e6e https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fcommit%2Fe5de47d928582694c7729d199390086983779e6e&sa=D&sntz=1&usg=AFQjCNFQkgiVz8SfE_dZ5Sa5K7TqYCIQ6g

I am not sure if this is a bug. It seems like a feature to prevent multiple masters by accident.

The strategy as described above by Christian Hedegaard should work, it is still to be considered a work-around:

setting up all new 1.4 nodes as not master eligible ("data only")
joining them to a 1.3.x cluster while master still is on a 1.3 node should work
then, shutting down all 1.3 nodes (except the master) should relocate the shards
bringing down the final 1.3 master should "stall" master election (I would also configure a large timeout for master election). This is critical, no index/mapping creations/deletions or cluster state modifying actions should be executed now.
adding a 1.4 master eligible node should now overtake the cluster (I would start it with the data folder from the final 1.3 master where the last cluster state is persisted) and the critical phase is over.
from then, more 1.4 master eligible nodes should be possible to add
finally, the minimum master nodes setting should be configured

Jörg

On Fri, Nov 21, 2014 at 1:56 AM, Christian Hedegaard <chedegaard@red5studios.com mailto:chedegaard@red5studios.com> wrote:
FYI, I have found a solution that works (at least for me).

I’ve got a small cluster for testing, only 4 v1.3.5 nodes. What I’ve done is bring up 4X new v1.4.0 nodes as data-only machines. In the yaml I added a line to point the nodes via unicast explicitly to the current master:
discovery.zen.ping.unicast.hosts: ["10.210.9.224:9300http://10.210.9.224:9300"]

When I restarted elasticsearch with that setting, with cloud-aws installed and configured on version 2.4.0, the new nodes found the cluster and properly joined it.

I will now start nuking the old v1.3.5 nodes to migrate the data off of them. Before the final 1.3.5 node is nuked, I will change the config on one of the v1.4.0 nodes to allow it as master and restart it.

I’m not sure if the master stuff is needed or not, but I was very afraid of a split-brain problem. I have another 4-node testing cluster that I will be able to try this upgrade again with in a more controlled manner.

I’m NOT looking forward to upgrading our current production cluster this way (15 data-only nodes, 3 master-only nodes).

So it would appear that the problem is somewhere in the unicast discovery code. The question is who’s to blame? Elasticsearch or the cloud-aws plugin?

From: Boaz Leskes [mailto:b.leskes@gmail.com mailto:b.leskes@gmail.com]
Sent: Wednesday, November 19, 2014 2:27 PM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Cc: Christian Hedegaard
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Hi Christian,

I'm not sure what thread you refer to exactly, but this shouldn't happen. Can you describe the problem you have some more? Anything in the nodes? (both the 1.4 node and the master)

Cheers,
Boaz

On Wednesday, November 19, 2014 2:39:57 AM UTC+1, Christian Hedegaard wrote:
I found this thread while trying to research the same issue and it looks like there is currently no resolution. We like to keep up on our elasticsearch upgrades as often as possible and do rolling upgrades to keep our clusters up. When testing I’m having the same issue, I cannot add a 1.4.0 box to the existing 1.3.4 cluster.

Is there a fix for this anticipated?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EFC28AE%40s-us-ex-6.US.R5S.com.
For more options, visit https://groups.google.com/d/optout.

Christian_Hedegaard · November 24, 2014, 10:01pm

This is working perfectly! I’ve got a test cluster that I’m in the middle of doing a rolling restart of with no issues:

elasticsearch- "number" : "1.4.0",
elasticsearch- "number" : "1.4.0",
elasticsearch- "number" : "1.3.4",
elasticsearch- "number" : "1.3.4",
elasticsearch- "number" : "1.3.4",

_cluster/health:

"cluster_name" : "elasticsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 5,
"number_of_data_nodes" : 5,
"active_primary_shards" : 1730,
"active_shards" : 3460,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0

I anticipate no other problems finishing this rolling upgrade. Thanks a ton everyone!

From: elasticsearch@googlegroups.com [mailto:elasticsearch@googlegroups.com] On Behalf Of David Pilato
Sent: Monday, November 24, 2014 8:31 AM
To: elasticsearch@googlegroups.com
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Heya,

We will release aws plugin 2.4.1 in some minutes.
It fixes this rolling upgrade issue.

Note that some WARN messages could appear in old nodes logs until the full rolling upgrade is done.

Thank you all for reporting this!

Le dimanche 23 novembre 2014 03:10:42 UTC+1, Ivan Brusic a écrit :

Great work everyone. Feel better about upgrading now.
On Nov 22, 2014 4:42 PM, "Boaz Leskes" <b.leskes@gmail.com mailto:b.leskes@gmail.com> wrote:
Hi Christian, Daniel,

I believe I found the issue - it has to do with the cloud plugins (both AWS and GCE) and the way they create the node list for the unicast based discovery. Effectively they mislead it to think that that all nodes on the cluster are version 1.4.0 which is not correct.

I opened issues for this so it will be corrected soon: https://github.com/elasticsearch/elasticsearch-cloud-aws/issues/143 , https://github.com/elasticsearch/elasticsearch-cloud-gce/issues/41

Cheers,
Boaz

On Saturday, November 22, 2014 7:04:33 PM UTC+2, Jörg Prante wrote:
As said, the change is due to unicast action, which was split in 1.4.0 to an old and a new action, see this commit:

https://github.com/elasticsearch/elasticsearch/commit/e5de47d928582694c7729d199390086983779e6e https://www.google.com/url?q=https%3A%2F%2Fgithub.com%2Felasticsearch%2Felasticsearch%2Fcommit%2Fe5de47d928582694c7729d199390086983779e6e&sa=D&sntz=1&usg=AFQjCNFQkgiVz8SfE_dZ5Sa5K7TqYCIQ6g

I am not sure if this is a bug. It seems like a feature to prevent multiple masters by accident.

The strategy as described above by Christian Hedegaard should work, it is still to be considered a work-around:

setting up all new 1.4 nodes as not master eligible ("data only")
joining them to a 1.3.x cluster while master still is on a 1.3 node should work
then, shutting down all 1.3 nodes (except the master) should relocate the shards
bringing down the final 1.3 master should "stall" master election (I would also configure a large timeout for master election). This is critical, no index/mapping creations/deletions or cluster state modifying actions should be executed now.
adding a 1.4 master eligible node should now overtake the cluster (I would start it with the data folder from the final 1.3 master where the last cluster state is persisted) and the critical phase is over.
from then, more 1.4 master eligible nodes should be possible to add
finally, the minimum master nodes setting should be configured

Jörg

On Fri, Nov 21, 2014 at 1:56 AM, Christian Hedegaard <chedegaard@red5studios.com mailto:chedegaard@red5studios.com> wrote:
FYI, I have found a solution that works (at least for me).

I’ve got a small cluster for testing, only 4 v1.3.5 nodes. What I’ve done is bring up 4X new v1.4.0 nodes as data-only machines. In the yaml I added a line to point the nodes via unicast explicitly to the current master:
discovery.zen.ping.unicast.hosts: ["10.210.9.224:9300http://10.210.9.224:9300"]

When I restarted elasticsearch with that setting, with cloud-aws installed and configured on version 2.4.0, the new nodes found the cluster and properly joined it.

I will now start nuking the old v1.3.5 nodes to migrate the data off of them. Before the final 1.3.5 node is nuked, I will change the config on one of the v1.4.0 nodes to allow it as master and restart it.

I’m not sure if the master stuff is needed or not, but I was very afraid of a split-brain problem. I have another 4-node testing cluster that I will be able to try this upgrade again with in a more controlled manner.

I’m NOT looking forward to upgrading our current production cluster this way (15 data-only nodes, 3 master-only nodes).

So it would appear that the problem is somewhere in the unicast discovery code. The question is who’s to blame? Elasticsearch or the cloud-aws plugin?

From: Boaz Leskes [mailto:b.leskes@gmail.com mailto:b.leskes@gmail.com]
Sent: Wednesday, November 19, 2014 2:27 PM
To: elasticsearch@googlegroups.com mailto:elasticsearch@googlegroups.com
Cc: Christian Hedegaard
Subject: Re: 1.4.0 data node can't join existing 1.3.4 cluster

Hi Christian,

I'm not sure what thread you refer to exactly, but this shouldn't happen. Can you describe the problem you have some more? Anything in the nodes? (both the 1.4 node and the master)

Cheers,
Boaz

On Wednesday, November 19, 2014 2:39:57 AM UTC+1, Christian Hedegaard wrote:
I found this thread while trying to research the same issue and it looks like there is currently no resolution. We like to keep up on our elasticsearch upgrades as often as possible and do rolling upgrades to keep our clusters up. When testing I’m having the same issue, I cannot add a 1.4.0 box to the existing 1.3.4 cluster.

Is there a fix for this anticipated?

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EBF409B%40s-us-ex-6.US.R5S.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/0bc369d9-1cd1-47ef-ba14-12fa29f5fd4b%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c6a4d157-8f10-485d-a52d-a6cc192e08ef%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/c6a4d157-8f10-485d-a52d-a6cc192e08ef%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5CF8216AA982AF47A8E6DEACA629D22B4EFC5EA8%40s-us-ex-6.US.R5S.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Nodes not joining after 1.4.0 upgrade Elasticsearch	14	563	July 6, 2017
Can't get Nodes to join AWS cluster Elasticsearch	3	583	July 6, 2017
Upgraded node unable to join cluster while attempting cluster upgrade from 1.3.2 to 1.4.2 Elasticsearch	9	774	July 6, 2017
Elasticsearch 1.0.1 on AWS Elasticsearch	16	496	July 6, 2017
Elasticsearch upgrade from 6.4.1 to 6.7.0, upgraded node is unable to join the cluster Elasticsearch	11	3547	May 8, 2019

1.4.0 data node can't join existing 1.3.4 cluster

Is there a fix for this anticipated?

Is there a fix for this anticipated?

Related topics