Unable to establish connection to master when upgrading to 6.7.0

Background: We had a 7 node cluster running 6.4.2 which we want to upgrade to 6.7.0. Connecting new nodes to this cluster with version 6.7.0 did not work. Connecting nodes of version 6.5.4 and 6.6.2 worked fine. We went for upgrading the cluster to 6.6.2 first.

So now we have a 6.6.2 cluster and we are unable to add 6.7.0 nodes to it. Adding more nodes of version 6.6.2 works fine.

Logs from nodes trying to connect to the existing cluster
[2019-04-01T18:06:14,585][INFO ][o.e.x.s.a.s.FileRolesStore] [elasticsearch4-1] parsed [0] roles from file [/usr/share/elasticsearch/config/roles.yml]
[2019-04-01T18:06:15,362][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [elasticsearch4-1] [controller/71] [Main.cc@109] controller (64 bit): Version 6.7.0 (Build d74ae2ac01b10d) Copyright (c) 2019 Elasticsearch BV
[2019-04-01T18:06:16,372][INFO ][o.e.d.DiscoveryModule ] [elasticsearch4-1] using discovery type [zen] and host providers [settings]
[2019-04-01T18:06:17,386][INFO ][o.e.n.Node ] [elasticsearch4-1] initialized
[2019-04-01T18:06:17,386][INFO ][o.e.n.Node ] [elasticsearch4-1] starting ...
[2019-04-01T18:06:17,546][INFO ][o.e.t.TransportService ] [elasticsearch4-1] publish_address {10.244.14.5:9300}, bound_addresses {0.0.0.0:9300}
[2019-04-01T18:06:17,562][INFO ][o.e.b.BootstrapChecks ] [elasticsearch4-1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-04-01T18:06:47,613][WARN ][o.e.n.Node ] [elasticsearch4-1] timed out while waiting for initial discovery state - timeout: 30s
[2019-04-01T18:06:47,626][INFO ][o.e.h.n.Netty4HttpServerTransport] [elasticsearch4-1] publish_address {10.244.14.5:9200}, bound_addresses {0.0.0.0:9200}
[2019-04-01T18:06:47,627][INFO ][o.e.n.Node ] [elasticsearch4-1] started
[2019-04-01T18:07:18,040][INFO ][o.e.c.s.ClusterApplierService] [elasticsearch4-1] detected_master {elasticsearch3-6}{CaAA7Em8ShqkRKhnHDURuw}{9ML67vr4RgCRFUQN9DUMug}{10.244.0.19}{10.244.0.19:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, added {{elasticsearch3-3}{HZAa-5edRU2W9M5vqO0n5Q}{WhzpFvNbSsmqCffJAqbYhw}{10.244.8.25}{10.244.8.25:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{elasticsearch3-1}{K9zC0fdZQAGGs4gPgGc1pw}{fEyZlApVSVKTkyv0aci98Q}{10.244.11.30}{10.244.11.30:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{elasticsearch3-5}{gV8YEzhHTYm-GlOSHuwp4Q}{ETzm1qu7Qgiae377DY9eLA}{10.244.9.24}{10.244.9.24:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{elasticsearch3-2}{5vTRWLrvTSW6Dptzy4hr4g}{LQa4EDL3QbmNMHPRsFG-8w}{10.244.10.24}{10.244.10.24:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{elasticsearch3-6}{CaAA7Em8ShqkRKhnHDURuw}{9ML67vr4RgCRFUQN9DUMug}{10.244.0.19}{10.244.0.19:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{elasticsearch3-0}{eHjpveEeTqiOTd7KwqppuA}{NwY-XamHTDa1eq2vQegAzQ}{10.244.5.25}{10.244.5.25:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{elasticsearch3-4}{yn6WcsxaSQOQfbIu_98MYg}{KVQtdkmDSEiC0SuZOyOeLA}{10.244.1.35}{10.244.1.35:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {elasticsearch3-6}{CaAA7Em8ShqkRKhnHDURuw}{9ML67vr4RgCRFUQN9DUMug}{10.244.0.19}{10.244.0.19:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [79304]])
[2019-04-01T18:07:23,803][WARN ][o.e.x.s.a.s.m.NativeRoleMappingStore] [elasticsearch4-1] Failed to clear cache for realms []
[2019-04-01T18:07:23,805][INFO ][o.e.x.s.a.TokenService ] [elasticsearch4-1] refresh keys
[2019-04-01T18:07:23,993][INFO ][o.e.x.s.a.TokenService ] [elasticsearch4-1] refreshed keys
[2019-04-01T18:07:24,521][INFO ][o.e.l.LicenseService ] [elasticsearch4-1] license [62fcc1be-002c-4c43-8c21-912ca5be6986] mode [basic] - valid
[2019-04-01T18:07:34,639][INFO ][o.e.d.z.ZenDiscovery ] [elasticsearch4-1] master_left [{elasticsearch3-6}{CaAA7Em8ShqkRKhnHDURuw}{9ML67vr4RgCRFUQN9DUMug}{10.244.0.19}{10.244.0.19:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]
[2019-04-01T18:07:34,640][WARN ][o.e.d.z.ZenDiscovery ] [elasticsearch4-1] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes:
{elasticsearch3-3}{HZAa-5edRU2W9M5vqO0n5Q}{WhzpFvNbSsmqCffJAqbYhw}{10.244.8.25}{10.244.8.25:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
{elasticsearch4-1}{paj8Zq0XQPyJ7E1asBNKhw}{9vi3p_E1QPqDNIpLoy_M6A}{10.244.14.5}{10.244.14.5:9300}{ml.machine_memory=16820711424, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local
{elasticsearch3-2}{5vTRWLrvTSW6Dptzy4hr4g}{LQa4EDL3QbmNMHPRsFG-8w}{10.244.10.24}{10.244.10.24:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
{elasticsearch3-0}{eHjpveEeTqiOTd7KwqppuA}{NwY-XamHTDa1eq2vQegAzQ}{10.244.5.25}{10.244.5.25:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}
{elasticsearch3-4}{yn6WcsxaSQOQfbIu_98MYg}{KVQtdkmDSEiC0SuZOyOeLA}{10.244.1.35}{10.244.1.35:9300}{ml.machine_memory=16820711424, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}

The joining and leaving keeps repeating.

The master node reports the following errors:
[2019-04-01T18:24:14,390][WARN ][o.e.t.TcpTransport ] [elasticsearch3-6] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.244.0.19:35870, remoteAddress=10.244.13.5/10.244.13.5:9300}], closing connection
java.lang.IllegalStateException: Message not fully read (response) for requestId [5725790], handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler/org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1@16dbe31e], error [false]; resetting
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1137) ~[elasticsearch-6.6.2.jar:6.6.2]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:914) [elasticsearch-6.6.2.jar:6.6.2]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) [transport-netty4-client-6.6.2.jar:6.6.2]

Hi @iremmats, thanks for the report. Are you running Elasticsearch using the official Docker images? If so, I think the issue is https://github.com/elastic/elasticsearch/issues/40511. If you set

logger.org.elasticsearch.action: DEBUG

then we should get a stack trace on the 6.6.2 node to confirm this.

Yes we run the official containers.

Found some more on this.

"caused_by" : {
"type" : "transport_serialization_exception",
"reason" : "Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]",
"caused_by" : {
"type" : "illegal_state_exception",
"reason" : "unexpected distribution type [docker]; your distribution is broken"
}

Looking at the source code at https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/Build.java tells me that 28 days ago the distribution type docker was added. So apparently the 6.6.2 docker image has distribution type "tar" and does not recognize distribution type "docker" that the 6.7.0 containers has and throws an error.

I read the GitHub issue now. Its spot on. :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.