Upgraded node to 6.7.0 => stuck in master_left/detected_master loop

We run a four-node cluster, currently at 6.6.1. I just started a rolling upgrade to 6.7.0 but the first node (node-004) I upgraded failed to join the cluster. It appears that:

  1. node-004 detects master node-003
  2. node-004 is started
  3. node-004 is initialized
  4. node-004 disconnects from cluster with reason "failed to ping"
  5. node-004 detects master node-003
  6. goto 4

The node gets stuck in this loop from detected_master->master_left->detected_master->...

I then rolled back node-004 to 6.6.1 and it joined the cluster correctly again. I also tried 6.6.2 and it also joined the cluster correctly. Tried a couple of more times with 6.7.0 with the same problem.

Any ideas on how to troubleshoot this?

Excerpt from the logs:

[2019-04-04T13:48:14,973][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [node-004] [controller/94] [Main.cc@109] controller (64 bit): Version 6.7.0 (Build d74ae2ac01b10d) Copyright (c) 2019 Elasticsearch BV
[2019-04-04T13:48:23,126][INFO ][o.e.d.DiscoveryModule    ] [node-004] using discovery type [zen] and host providers [settings, file]
[2019-04-04T13:48:23,707][INFO ][o.e.n.Node               ] [node-004] initialized
[2019-04-04T13:48:23,707][INFO ][o.e.n.Node               ] [node-004] starting ...
[2019-04-04T13:48:23,801][INFO ][o.e.t.TransportService   ] [node-004] publish_address {10.33.9.93:9300}, bound_addresses {0.0.0.0:9300}
[2019-04-04T13:48:24,497][INFO ][o.e.b.BootstrapChecks    ] [node-004] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-04-04T13:48:24,506][INFO ][c.f.s.c.IndexBaseConfigurationRepository] [node-004] Check if searchguard index exists ...
[2019-04-04T13:48:28,365][INFO ][o.e.c.s.ClusterApplierService] [node-004] detected_master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node-002}{tLO4nmUvRjOySDugDz1EYA}{709EMTswQJawm1naD1XmUA}{10.33.9.91}{10.33.9.91:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{node-001}{ioxGy9M6S9i4gpkKjDrslQ}{uBZMhAFrQrqw2ZoerK0rHA}{10.33.14.74}{10.33.14.74:9300}{ml.machine_memory=68179443712, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [97250]])
[2019-04-04T13:48:28,478][INFO ][o.e.c.s.ClusterSettings  ] [node-004] updating [cluster.routing.allocation.enable] from [all] to [none]
[2019-04-04T13:48:30,286][INFO ][o.e.l.LicenseService     ] [node-004] license [28ba49bb-033c-4383-b333-f2f77e80c96f] mode [basic] - valid
[2019-04-04T13:48:30,310][INFO ][o.e.h.n.Netty4HttpServerTransport] [node-004] publish_address {10.33.9.93:9200}, bound_addresses {0.0.0.0:9200}
[2019-04-04T13:48:30,311][INFO ][o.e.n.Node               ] [node-004] started
[2019-04-04T13:48:30,311][INFO ][c.f.s.SearchGuardPlugin  ] [node-004] 0 Search Guard modules loaded so far: []
[2019-04-04T13:48:30,431][INFO ][c.f.s.c.IndexBaseConfigurationRepository] [node-004] Search Guard License Info: No license needed because enterprise modules are not enabled
[2019-04-04T13:48:30,432][INFO ][c.f.s.c.IndexBaseConfigurationRepository] [node-004] Node 'node-004' initialized
[2019-04-04T13:48:32,526][INFO ][o.e.d.z.ZenDiscovery     ] [node-004] master_left [{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-04-04T13:48:32,527][WARN ][o.e.d.z.ZenDiscovery     ] [node-004] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes:
   {node-004}{dw9erMZ6ScKfdpJhQCMRnA}{j0hEhhWXREy7SlWlUBJijQ}{10.33.9.93}{10.33.9.93:9300}{ml.machine_memory=68311592960, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local
   {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master
   {node-002}{tLO4nmUvRjOySDugDz1EYA}{709EMTswQJawm1naD1XmUA}{10.33.9.91}{10.33.9.91:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
   {node-001}{ioxGy9M6S9i4gpkKjDrslQ}{uBZMhAFrQrqw2ZoerK0rHA}{10.33.14.74}{10.33.14.74:9300}{ml.machine_memory=68179443712, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
[2019-04-04T13:48:35,941][INFO ][o.e.c.s.ClusterApplierService] [node-004] detected_master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [97256]])
[2019-04-04T13:48:42,970][INFO ][o.e.d.z.ZenDiscovery     ] [node-004] master_left [{node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with  maximum [30s] timeout]
[2019-04-04T13:48:42,970][WARN ][o.e.d.z.ZenDiscovery     ] [node-004] master left (reason = failed to ping, tried [3] times, each with  maximum [30s] timeout), current nodes: nodes:
    {node-004}{dw9erMZ6ScKfdpJhQCMRnA}{j0hEhhWXREy7SlWlUBJijQ}{10.33.9.93}{10.33.9.93:9300}{ml.machine_memory=68311592960, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local
    {node-003}{fzlHKosuRseXxSPur04-Sg}{RR9kG7crRk-1mwwBDB4KAg}{10.33.9.92}{10.33.9.92:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master
    {node-002}{tLO4nmUvRjOySDugDz1EYA}{709EMTswQJawm1naD1XmUA}{10.33.9.91}{10.33.9.91:9300}{ml.machine_memory=68311584768, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
    {node-001}{ioxGy9M6S9i4gpkKjDrslQ}{uBZMhAFrQrqw2ZoerK0rHA}{10.33.14.74}{10.33.14.74:9300}{ml.machine_memory=68179443712, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}

Are you running this node in Docker? There is a known issue with Docker deployments in 6.7.0 which will be fixed in 6.7.1.

If not, can you check the other nodes' logs for exceptions, and perhaps set logger.org.elasticsearch.action: DEBUG to hopefully see some more detail.

Thanks!

Yes, my nodes run as Docker containers. Will wait for 6.7.1 then.

6.7.1 is released.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.