We run a four-node cluster, currently at 6.6.1. I just started a rolling upgrade to 6.7.0 but the first node (node-004) I upgraded failed to join the cluster. It appears that:
- node-004 detects master node-003
- node-004 is started
- node-004 is initialized
- node-004 disconnects from cluster with reason "failed to ping"
- node-004 detects master node-003
- goto 4
The node gets stuck in this loop from detected_master->master_left->detected_master->...
I then rolled back node-004 to 6.6.1 and it joined the cluster correctly again. I also tried 6.6.2 and it also joined the cluster correctly. Tried a couple of more times with 6.7.0 with the same problem.
Any ideas on how to troubleshoot this?
Excerpt from the logs:
[2019-04-04T13:48:14,973][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-004] [controller/94] [Main.cc@109] controller (64 bit): Version 6.7.0 (Build d74ae2ac01b10d) Copyright (c) 2019 Elasticsearch BV
[2019-04-04T13:48:23,126][INFO ][o.e.d.DiscoveryModule ] [node-004] using discovery type [zen] and host providers [settings, file]
[2019-04-04T13:48:23,707][INFO ][o.e.n.Node ] [node-004] initialized
[2019-04-04T13:48:23,707][INFO ][o.e.n.Node ] [node-004] starting ...
[2019-04-04T13:48:23,801][INFO ][o.e.t.TransportService ] [node-004] publish_address {10.33.9.93:9300}, bound_addresses {0.0.0.0:9300}
[2019-04-04T13:48:24,497][INFO ][o.e.b.BootstrapChecks ] [node-004] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-04-04T13:48:24,506][INFO ][c.f.s.c.IndexBaseConfigurationRepository] [node-004] Check if searchguard index exists ...
[2019-04-04T13:48:28,365][INFO ][o.e.c.s.ClusterApplierService] [node-004] detected_master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node-002}{tLO4nmUvRjOySDugDz1EYA}{709EMTswQJawm1naD1XmUA}{10.33.9.91}{10.33.9.91:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node-001}{ioxGy9M6S9i4gpkKjDrslQ}{uBZMhAFrQrqw2ZoerK0rHA}{10.33.14.74}{10.33.14.74:9300}{ml.machine_memory=68179443712, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [97250]])
[2019-04-04T13:48:28,478][INFO ][o.e.c.s.ClusterSettings ] [node-004] updating [cluster.routing.allocation.enable] from [all] to [none]
[2019-04-04T13:48:30,286][INFO ][o.e.l.LicenseService ] [node-004] license [28ba49bb-033c-4383-b333-f2f77e80c96f] mode [basic] - valid
[2019-04-04T13:48:30,310][INFO ][o.e.h.n.Netty4HttpServerTransport] [node-004] publish_address {10.33.9.93:9200}, bound_addresses {0.0.0.0:9200}
[2019-04-04T13:48:30,311][INFO ][o.e.n.Node ] [node-004] started
[2019-04-04T13:48:30,311][INFO ][c.f.s.SearchGuardPlugin ] [node-004] 0 Search Guard modules loaded so far: []
[2019-04-04T13:48:30,431][INFO ][c.f.s.c.IndexBaseConfigurationRepository] [node-004] Search Guard License Info: No license needed because enterprise modules are not enabled
[2019-04-04T13:48:30,432][INFO ][c.f.s.c.IndexBaseConfigurationRepository] [node-004] Node 'node-004' initialized
[2019-04-04T13:48:32,526][INFO ][o.e.d.z.ZenDiscovery ] [node-004] master_left [{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-04-04T13:48:32,527][WARN ][o.e.d.z.ZenDiscovery ] [node-004] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{node-004}{dw9erMZ6ScKfdpJhQCMRnA}{j0hEhhWXREy7SlWlUBJijQ}{10.33.9.93}{10.33.9.93:9300}{ml.machine_memory=68311592960, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local
{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master
{node-002}{tLO4nmUvRjOySDugDz1EYA}{709EMTswQJawm1naD1XmUA}{10.33.9.91}{10.33.9.91:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
{node-001}{ioxGy9M6S9i4gpkKjDrslQ}{uBZMhAFrQrqw2ZoerK0rHA}{10.33.14.74}{10.33.14.74:9300}{ml.machine_memory=68179443712, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
[2019-04-04T13:48:35,941][INFO ][o.e.c.s.ClusterApplierService] [node-004] detected_master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [97256]])
[2019-04-04T13:48:42,970][INFO ][o.e.d.z.ZenDiscovery ] [node-004] master_left [{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-04-04T13:48:42,970][WARN ][o.e.d.z.ZenDiscovery ] [node-004] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{node-004}{dw9erMZ6ScKfdpJhQCMRnA}{j0hEhhWXREy7SlWlUBJijQ}{10.33.9.93}{10.33.9.93:9300}{ml.machine_memory=68311592960, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local
{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master
{node-002}{tLO4nmUvRjOySDugDz1EYA}{709EMTswQJawm1naD1XmUA}{10.33.9.91}{10.33.9.91:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
{node-001}{ioxGy9M6S9i4gpkKjDrslQ}{uBZMhAFrQrqw2ZoerK0rHA}{10.33.14.74}{10.33.14.74:9300}{ml.machine_memory=68179443712, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}