Hi,
I'm trying to do a rolling upgrade for es 6.6.0 to 6.7.0. We have our cluster deployed in kubernetes. When I restart one of the datanodes to upgrade it with the 6.7.0 image, it gets stuck in a loop reporting online / offline every 10 seconds or so with the following sort of message:
[2019-04-03T22:40:22,324][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] restarting fault detection against master [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{
ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [new cluster state received and we are monitoring the wrong master [null]]
[2019-04-03T22:40:22,324][INFO ][o.e.c.s.ClusterApplierService] [elastic-datanodes-9] detected_master {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.m
ax_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328,
ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [6426]])
[2019-04-03T22:40:29,333][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [1] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,335][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [2] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,336][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [3] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,337][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], tried [3] times, each with maximum [30s] timeout
[2019-04-03T22:40:29,337][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] stopping fault detection against master [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml
.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [master failure, failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-04-03T22:40:29,337][INFO ][o.e.d.z.ZenDiscovery ] [elastic-datanodes-9] master_left [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open
_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-04-03T22:40:29,337][WARN ][o.e.d.z.ZenDiscovery ] [elastic-datanodes-9] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{elastic-masternodes-1}{KBag42INRDmH7ktizQ1KQg}{ejiOIB5qRr6AkvirtrDv2w}{10.0.11.241}{10.0.11.241:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master
...
I don't think this is a network issue because from within the pod I am able to ping and connect to port 9300 of the master node that it's saying it failed to ping. Also, when I revert the image back to 6.6.0 this message goes away and the cluster recovers.
I'm wondering why and what this message means, and also if it is safe to reenable shard allocation while this node is connecting / reconnecting in order to continue with the rolling upgrade.
Also, the guide is not clear whether I should be upgrading the master node(s) or datanodes first (or a combination of both). Any tips here would be greatly appreciated.
Thanks