Rolling upgrade node reconnects intermittently

Steven_Cohen · April 3, 2019, 9:47pm

Hi,

I'm trying to do a rolling upgrade for es 6.6.0 to 6.7.0. We have our cluster deployed in kubernetes. When I restart one of the datanodes to upgrade it with the 6.7.0 image, it gets stuck in a loop reporting online / offline every 10 seconds or so with the following sort of message:

[2019-04-03T22:40:22,324][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] restarting fault detection against master [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{
ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [new cluster state received and we are monitoring the wrong master [null]]
[2019-04-03T22:40:22,324][INFO ][o.e.c.s.ClusterApplierService] [elastic-datanodes-9] detected_master {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.m
ax_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328,
ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [6426]])
[2019-04-03T22:40:29,333][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [1] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,335][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [2] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,336][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [3] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,337][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], tried [3] times, each with maximum [30s] timeout
[2019-04-03T22:40:29,337][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] stopping fault detection against master [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml
.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [master failure, failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-04-03T22:40:29,337][INFO ][o.e.d.z.ZenDiscovery     ] [elastic-datanodes-9] master_left [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open
_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-04-03T22:40:29,337][WARN ][o.e.d.z.ZenDiscovery     ] [elastic-datanodes-9] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes:
   {elastic-masternodes-1}{KBag42INRDmH7ktizQ1KQg}{ejiOIB5qRr6AkvirtrDv2w}{10.0.11.241}{10.0.11.241:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master
   ...

I don't think this is a network issue because from within the pod I am able to ping and connect to port 9300 of the master node that it's saying it failed to ping. Also, when I revert the image back to 6.6.0 this message goes away and the cluster recovers.

I'm wondering why and what this message means, and also if it is safe to reenable shard allocation while this node is connecting / reconnecting in order to continue with the rolling upgrade.

Also, the guide is not clear whether I should be upgrading the master node(s) or datanodes first (or a combination of both). Any tips here would be greatly appreciated.

Thanks

DavidTurner · April 4, 2019, 1:38pm

There is a known issue with Elasticsearch 6.7.0 in docker that I think would explain this. This'll be fixed in 6.7.1, so it's probably simplest just to wait for that to be released.

system · May 2, 2019, 1:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Upgraded node to 6.7.0 => stuck in master_left/detected_master loop Elasticsearch	4	549	May 2, 2019
Upgrade cluster from 0.17.8 to 0.18.1 nuwer node unable to join Elasticsearch	3	303	July 6, 2017
Elasticsearch upgrade from 6.4.1 to 6.7.0, upgraded node is unable to join the cluster Elasticsearch	11	3568	May 8, 2019
Rolling restarts issue Elasticsearch	5	630	July 5, 2017
Unable to establish connection to master when upgrading to 6.7.0 Elasticsearch	4	1177	April 30, 2019

Rolling upgrade node reconnects intermittently

Related topics