Upgraded node not rejoining non upgraded cluster 6.5.1 to 6.7 upgrade - IOException[Invalid string; unexpected character: 255 hex: ff]

Upgraded 1 node of a 4 node cluster this mroing from 6.5.1 to 6.7
Upgraded node starts up fine & comes online but is refusing to joing the existing cluster since the upgrade.

Elasticsearch.yml is config
cluster.name: CT-ES-PROD
discovery.zen.ping.unicast.hosts:
- wp005885.ourdomain.com:9200
- wp005878.ourdomain.com:9200
- wp005881.ourdomain.com:9200
- wp005882.ourdomain.com:9200
http.port: 9200
transport.tcp.port: 9300
network.host: wp005885.ourdomain.com

Nothing standing out from the logs as to why the 6.5.1 nodes are not being discivered

2019-03-28T11:44:05,736][INFO ][o.e.d.DiscoveryModule ] [WP005885-ONYX-NODE1] using discovery type [zen] and host providers [settings]
[2019-03-28T11:44:06,533][INFO ][o.e.n.Node ] [WP005885-ONYX-NODE1] initialized
[2019-03-28T11:44:06,533][INFO ][o.e.n.Node ] [WP005885-ONYX-NODE1] starting ...
[2019-03-28T11:44:06,767][INFO ][o.e.t.TransportService ] [WP005885-ONYX-NODE1] publish_address {10.105.50.53:9300}, bound_addresses {[fe80::ddf:6102:4b6f:193]:9300}, {10.105.50.53:9300}
[2019-03-28T11:44:07,174][INFO ][o.e.b.BootstrapChecks ] [WP005885-ONYX-NODE1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-03-28T11:44:10,221][INFO ][o.e.c.s.MasterService ] [WP005885-ONYX-NODE1] zen-disco-elected-as-master ([0] nodes joined), reason: new_master {WP005885-ONYX-NODE1}{VzRIxScoSMWNqkPVwNl2AQ}{LEvYVWjEStW1ikUvYflPQw}{wp005885.bskyb.com}{10.105.50.53:9300}{ml.machine_memory=17179332608, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
[2019-03-28T11:44:10,221][INFO ][o.e.c.s.ClusterApplierService] [WP005885-ONYX-NODE1] new_master {WP005885-ONYX-NODE1}{VzRIxScoSMWNqkPVwNl2AQ}{LEvYVWjEStW1ikUvYflPQw}{wp005885.bskyb.com}{10.105.50.53:9300}{ml.machine_memory=17179332608, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {WP005885-ONYX-NODE1}{VzRIxScoSMWNqkPVwNl2AQ}{LEvYVWjEStW1ikUvYflPQw}{wp005885.bskyb.com}{10.105.50.53:9300}{ml.machine_memory=17179332608, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)]])
[2019-03-28T11:44:10,283][INFO ][o.e.h.n.Netty4HttpServerTransport] [WP005885-ONYX-NODE1] publish_address {10.105.50.53:9200}, bound_addresses {[fe80::ddf:6102:4b6f:193]:9200}, {10.105.50.53:9200}
[2019-03-28T11:44:10,283][INFO ][o.e.n.Node ] [WP005885-ONYX-NODE1] started
[20

How many master-eligible nodes do you have in the cluster? (A node is master-eligible by default, unless node.master: false is set.) Also you do not seem to have set discovery.zen.minimum_master_nodes, which might mean you have caused a split-brain.

i have 4 master eligable just now
Looks like it might be the split brain issue ....

so set
discovery.zen.minimum_master_nodes: 3

performed a full cluster restart

3 unupgraded nodes have come back online into their own cluster
{
"cluster_name": "CT-ES-PROD",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 3,
"number_of_data_nodes": 3,
"active_primary_shards": 256,
"active_shards": 256,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 256,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 50
}

the upgraded node is now not starting as it cant see the minium number of master nodes

{
"error": {
"root_cause": [
{
"type": "master_not_discovered_exception",
"reason": null
}
],
"type": "master_not_discovered_exception",
"reason": null
},
"status": 503
}

& seeing
[2019-03-28T12:56:17,971][WARN ][o.e.d.z.ZenDiscovery ] [WP005885-ONYX-NODE1] not enough master nodes discovered during pinging (found [[Candidate{node={WP005885-ONYX-NODE1}{VzRIxScoSMWNqkPVwNl2AQ}{Io0sILzTSk-tWUeRw2K5CQ}{wp005885.bskyb.com}{10.105.50.53:9300}{ml.machine_memory=17179332608, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, clusterStateVersion=-1}]], but needed [3]), pinging again

in the logs

Ok, that sounds like progress.

This newly-upgraded node is now apparently unable to discover the other nodes. Can you check that:

  • discovery.zen.ping.unicast.hosts is set the same on this node as on the other nodes.
  • This node can resolve the DNS names to the right IP addresses.

I'm suspicious about the port 9200 in its config. That's the HTTP port by default, and coincides with http.port below, but you should be pointing discovery.zen.ping.unicast.hosts at transport ports which are normally 9300.

so it looks like the :9200 were added when i started up the upgraded node, i'd originaly listed them without the port to use the default

We also have this set further down
transport.tcp.port: 9300

going to try changing to 9300 & see what happens

IP resolution looks fine as i can see traffic in & out on port 9200 & 9300 from the affected node to an ok node via wireshark

so looks like it can now see the elected master
but its either not loking the response or somethign wrong with the join request...

[2019-03-28T13:17:58,990][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [WP005885-ONYX-NODE1] [controller/2240] [Main.cc@109] controller (64 bit): Version 6.7.0 (Build d74ae2ac01b10d) Copyright (c) 2019 Elasticsearch BV
[2019-03-28T13:18:00,927][INFO ][o.e.d.DiscoveryModule ] [WP005885-ONYX-NODE1] using discovery type [zen] and host providers [settings]
[2019-03-28T13:18:01,740][INFO ][o.e.n.Node ] [WP005885-ONYX-NODE1] initialized
[2019-03-28T13:18:01,740][INFO ][o.e.n.Node ] [WP005885-ONYX-NODE1] starting ...
[2019-03-28T13:18:01,990][INFO ][o.e.t.TransportService ] [WP005885-ONYX-NODE1] publish_address {10.105.50.53:9300}, bound_addresses {[fe80::206c:1b74:1ea:62f1]:9300}, {[fe80::ddf:6102:4b6f:193]:9300}, {10.105.50.53:9300}, {169.254.98.241:9300}
[2019-03-28T13:18:02,412][INFO ][o.e.b.BootstrapChecks ] [WP005885-ONYX-NODE1] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2019-03-28T13:18:05,552][INFO ][o.e.d.z.ZenDiscovery ] [WP005885-ONYX-NODE1] failed to send join request to master [{WP005881-SCOLO-NODE3}{9uRHBkOrQuu_5GVvvtLuUg}{RDt-KIgWTquNUVYrbT70-w}{WP005881.bskyb.com}{10.137.48.60:9300}{ml.machine_memory=17179332608, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[WP005881-SCOLO-NODE3][10.137.48.60:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportException[[WP005885-ONYX-NODE1][10.105.50.53:9300][internal:discovery/zen/join/validate]]; nested: IOException[Invalid string; unexpected character: 255 hex: ff]; ]

[2019-03-28T13:18:29,818][INFO ][o.e.d.z.ZenDiscovery ] [WP005885-ONYX-NODE1] failed to send join request to master [{WP005881-SCOLO-NODE3}{9uRHBkOrQuu_5GVvvtLuUg}{RDt-KIgWTquNUVYrbT70-w}{WP005881.bskyb.com}{10.137.48.60:9300}{ml.machine_memory=17179332608, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[WP005881-SCOLO-NODE3][10.137.48.60:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportException[[WP005885-ONYX-NODE1][10.105.50.53:9300][internal:discovery/zen/join/validate]]; nested: IOException[Invalid string; unexpected character: 255 hex: ff]; ]
[2019-03-28T13:18:32,443][WARN ][o.e.n.Node ] [WP005885-ONYX-NODE1] timed out while waiting for initial discovery state - timeout: 30s
[2019-03

IOException[Invalid string; unexpected character: 255 hex: ff] seems to be the sticking point just now

Ok, more progress, but this is a very strange error that suggests a bug. Unfortunately the message isn't very helpful - I think we need to see the whole internal:discovery/zen/join/validate message to work out what's going wrong. Can you add the following to the upgraded node's elasticsearch.yml and restart it?

logger.org.elasticsearch.transport.netty4.ESLoggingHandler: TRACE
logger.org.elasticsearch.transport.TransportLogger: TRACE

This'll log a hex dump of every message. Once you've seen another one of these IOExceptions you can shut the node down and remove those logging lines again. Then please could you share the log, or at least the dump of the internal:discovery/zen/join/validate message? It will contain a lot of detail about your cluster, so you might prefer to send it to me directly at david.turner@elastic.co instead of sharing it publicly.

Hi Dave
Have emailed you to logs, hopefully they got through both email filters

so managed to get the node up
i removed all our ML indicies (as we wasnt really using ML in anger) & then it;s allowed the cluster to start & join the cluster ok

Great, thanks for letting us know. I've seen a couple of other reports of this, and its useful to know that it might be to do with ML. We'll get to the bottom of it eventually.

I suspect this is the cause: https://github.com/elastic/elasticsearch/pull/40610

1 Like

If this is the source of your troubles, you can work around this by upgrading to 6.6.2, and then upgrading to 6.7.0.

Thanks Jason, getting the error again on a differnt node so will see if this works just now

@DavidTurner sent you the wireshark traces from the lastest instance just in case there are usefull

@jasontedor
sadly that work around didnt fix the issue for us, still getting the same error on start up

Via another channel, I learned that the cluster was successfully upgraded to 6.6.2; one node is upgraded to 6.7.0 but the second node is struggling:

still getting the same failed to send join request on the 2nd node

Just to check @John_Swift, this message still ends in nested: IOException[Invalid string; unexpected character: 255 hex: ff]; ], correct? To be sure that we're really talking about the same thing could you share the full message and the stack trace? Can you also share GET /_cat/nodes?v&h=id,ip,port,v,m to demonstrate that this is now a mixed 6.6.2/6.7.0 cluster?

@DavidTurner
not sure how the 6.5.1 node is still there. Windows service that is running shows the 6.7.0

id ip port v m
JqqW 10.105.158.40 9300 6.6.2 -
9uRH 10.137.48.60 9300 6.6.2 -
VzRI 10.105.50.53 9300 6.5.1 *

let me uinstall ES from that box & try again......

Ah good, that'd explain it. The faulty message is coming from the master, which is still pre-6.6.2.

Hi,

We're also seeing the same issue - went from a 3 node cluster of 6.3.0 to upgrading a node to 6.7.0.

That node is no longer able to join the cluster, and posts the same error:

[2019-03-29T14:39:45,776][INFO ][o.e.d.z.ZenDiscovery ] [HOSTNAME-elastic-5] failed to send join request to master [{HOSTNAME-elastic-3}{-12c-0gzRJqY0dhxkbNPng}{5Ipsr27sSeiri9EuAPE9F Q}{IPADDRESS-elastic-3}{IPADDRESS-elastic-3:9300}{ml.machine_memory=67384311808, rack=LOCATION, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportExce ption[[HOSTNAME-elastic-3][IPADDRESS-elastic-3:9300][internal:discovery/zen/join]]; nested: IllegalStateException[failure when sending a validation request to node]; nested: RemoteTransportEx ception[[HOSTNAME-elastic-5][IPADDRESS-elastic-5:9300][internal:discovery/zen/join/validate]]; nested: IOException[Invalid string; unexpected character: 255 hex: ff]; ]

& like magic as soon as that node stipped the 6.7 one joined the cluster.....

id ip port v m
JqqW 10.105.158.40 9300 6.6.2 -
9uRH 10.137.48.60 9300 6.6.2 *
gUX9 10.137.54.61 9300 6.7.0 -

still not quite sure how the 6.5 version was running via the 6.7 win service but its looking much better now

1 Like

Hi @smbnj, as mentioned above, for now you should upgrade your whole cluster to 6.6.2 first and thence on to 6.7.0.