Unicast not working after upgrade to 20.2

Hi All

I upgraded my cluster from 19.8 to 20.2 and was getting spit brain
confusion. here is what I did:
in config/elasticsearch.yml I have the following (this is same before and
after updrade
path.data: /mnt/sda2/data/
path.logs: /mnt/sda2/logs/elasticsearch
node.master: true
node.data: true
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["192.168.1.9", "192.168.1.8",
"129.168.1.7", "192.168.1.6"]

so I stopped all nodes (running 19.8) upgraded to 20.2 then started all
nodes (I assume this is what is referred to as 'a full cluster restart'?)
When I noticed high cpu usage for 30 minutes I investigated using bigdesk
and noticed notes were in separate clusters

I did not find any reference in changelogs to unicast or multicast so find
this behaviour strange. I had to revert back to 19.8 to bring things back
to sanity.

Anyone have ant insight to this behaviour?

Btw the reason im using unicast is the network does not support it.

GX

--

Ok so unicast and multicast is too scary for anyone to touch, but can
someone please at least clarify what is meant by "a full cluster restart".
Am I correct in assuming that all nodes need to be shut down, upgraded then
started, or can I upgrade the nodes one by one while the others are still
running to minimize downtime?

GX

IMHO a full cluster restart is a global shutdown, an upgrade of all cluster and a restart. So, yes, you will have downtime, but perhaps under the minute if you have prepared everything before...

HTH

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 25 janv. 2013 à 04:01, GX mailme.gx@gmail.com a écrit :

Ok so unicast and multicast is too scary for anyone to touch, but can someone please at least clarify what is meant by "a full cluster restart". Am I correct in assuming that all nodes need to be shut down, upgraded then started, or can I upgrade the nodes one by one while the others are still running to minimize downtime?

GX

--

Did you manage to resolve this? Full cluster restart is restart all the nodes and start them with the new version. Unicast disco works the same as 0.19.

On Jan 25, 2013, at 4:01 AM, GX mailme.gx@gmail.com wrote:

Ok so unicast and multicast is too scary for anyone to touch, but can someone please at least clarify what is meant by "a full cluster restart". Am I correct in assuming that all nodes need to be shut down, upgraded then started, or can I upgrade the nodes one by one while the others are still running to minimize downtime?

GX

Hi kimchy

Yes I managed to resolve this:
I think the problem causing the split brain may have been a wrong ip in the
list, note the third element is 129 not 192

discovery.zen.ping.unicast.hosts: ["192.168.1.9", "192.168.1.8",
"129.168.1.7", "192.168.1.6"]

after that I managed to get all nodes in one cluster, at some point one of
the nodes was unresponsive, after I tried to restart it it kept failing to
start, restarting other nodes also failed to start. I panicked and stopped
all nodes swiched back to 19.8 and tried to start, again all nodes failed..
at this point people were asking some very serious questions... well to
cut a long story short I looked into the logs and it complained of special
character not allowed in yml file, I found in one of the comments a french
character "a' la raid" I changed it to a normal "a" and managed to start
all nodes (in 20.2) fine, status was green in a couple of minutes.

I cant understand why the conf file suddenly caused nodes to stop working,
I can only assume when I fixed the ips the encoding was changed (probably
to utf-8) and ES or java didnt like that.

All seems well now and things are running fine. I have been hounding the
network administrators for an answer to if multicast is enabled and why its
not possible but have not managed to get a strait coherent answer.

Regards

GX

On Sunday, January 27, 2013 12:34:02 PM UTC+2, kimchy wrote:

Did you manage to resolve this? Full cluster restart is restart all the
nodes and start them with the new version. Unicast disco works the same as
0.19.

On Jan 25, 2013, at 4:01 AM, GX <mail...@gmail.com <javascript:>> wrote:

Ok so unicast and multicast is too scary for anyone to touch, but can
someone please at least clarify what is meant by "a full cluster restart".
Am I correct in assuming that all nodes need to be shut down, upgraded then
started, or can I upgrade the nodes one by one while the others are still
running to minimize downtime?

GX

--

Are you using Elasticsearch on Windows? Wondering if you had the same
issue: Replace à with a by dadoonet · Pull Request #2389 · elastic/elasticsearch · GitHub

Problem should have been "fixed" on 0.20.2 though

--
Ivan

On Sun, Jan 27, 2013 at 2:51 AM, GX mailme.gx@gmail.com wrote:

well to cut a long story short I looked into the logs and it complained of
special character not allowed in yml file, I found in one of the comments
a french character "a' la raid" I changed it to a normal "a" and managed to
start all nodes (in 20.2) fine, status was green in a couple of minutes.

I cant understand why the conf file suddenly caused nodes to stop working,
I can only assume when I fixed the ips the encoding was changed (probably
to utf-8) and ES or java didnt like that.

Hi Ivan

No Im not using windows, Im using slackware (both development and
production), but that is the exact error I had. I may have copied the 19.8
config yml file to 20.2 to keep my settings..

Thanks for the input

GX

On Sunday, January 27, 2013 6:36:33 PM UTC+2, Ivan Brusic wrote:

Are you using Elasticsearch on Windows? Wondering if you had the same
issue: Replace à with a by dadoonet · Pull Request #2389 · elastic/elasticsearch · GitHub

Problem should have been "fixed" on 0.20.2 though

--
Ivan

On Sun, Jan 27, 2013 at 2:51 AM, GX <mail...@gmail.com <javascript:>>wrote:

well to cut a long story short I looked into the logs and it complained
of special character not allowed in yml file, I found in one of the
comments a french character "a' la raid" I changed it to a normal "a" and
managed to start all nodes (in 20.2) fine, status was green in a couple of
minutes.

I cant understand why the conf file suddenly caused nodes to stop
working, I can only assume when I fixed the ips the encoding was changed
(probably to utf-8) and ES or java didnt like that.

--