Rolling upgrade node reconnects intermittently

Hi,

I'm trying to do a rolling upgrade for es 6.6.0 to 6.7.0. We have our cluster deployed in kubernetes. When I restart one of the datanodes to upgrade it with the 6.7.0 image, it gets stuck in a loop reporting online / offline every 10 seconds or so with the following sort of message:

[2019-04-03T22:40:22,324][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] restarting fault detection against master [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{
ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [new cluster state received and we are monitoring the wrong master [null]]
[2019-04-03T22:40:22,324][INFO ][o.e.c.s.ClusterApplierService] [elastic-datanodes-9] detected_master {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.m
ax_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328,
ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [6426]])
[2019-04-03T22:40:29,333][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [1] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,335][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [2] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,336][TRACE][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retry [3] out of [3]
org.elasticsearch.transport.RemoteTransportException: [elastic-masternodes-2][10.0.14.87:9300][internal:discovery/zen/fd/master_ping]
Caused by: java.lang.IllegalStateException
[2019-04-03T22:40:29,337][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] failed to ping [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=675870433
28, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], tried [3] times, each with maximum [30s] timeout
[2019-04-03T22:40:29,337][DEBUG][o.e.d.z.MasterFaultDetection] [elastic-datanodes-9] [master] stopping fault detection against master [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml
.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [master failure, failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-04-03T22:40:29,337][INFO ][o.e.d.z.ZenDiscovery     ] [elastic-datanodes-9] master_left [{elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open
_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-04-03T22:40:29,337][WARN ][o.e.d.z.ZenDiscovery     ] [elastic-datanodes-9] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes:
   {elastic-masternodes-1}{KBag42INRDmH7ktizQ1KQg}{ejiOIB5qRr6AkvirtrDv2w}{10.0.11.241}{10.0.11.241:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {elastic-masternodes-2}{WAwlh6G4SN2axgMoyg7IyQ}{dbAj0ALKTr-CJCZp3unloQ}{10.0.14.87}{10.0.14.87:9300}{ml.machine_memory=67587043328, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master
   ...

I don't think this is a network issue because from within the pod I am able to ping and connect to port 9300 of the master node that it's saying it failed to ping. Also, when I revert the image back to 6.6.0 this message goes away and the cluster recovers.

I'm wondering why and what this message means, and also if it is safe to reenable shard allocation while this node is connecting / reconnecting in order to continue with the rolling upgrade.

Also, the guide is not clear whether I should be upgrading the master node(s) or datanodes first (or a combination of both). Any tips here would be greatly appreciated.

Thanks

There is a known issue with Elasticsearch 6.7.0 in docker that I think would explain this. This'll be fixed in 6.7.1, so it's probably simplest just to wait for that to be released.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.