Upgraded node not rejoining non upgraded cluster 6.5.1 to 6.7 upgrade - IOException[Invalid string; unexpected character: 255 hex: ff]

Hi @DavidTurner - will bringing down one of the 2 remaining nodes in order to upgrade it to 6.6.2 not result in the cluster failing and loss of data?

@DavidTurner we are now fully upgraded !!
gUX9 10.137.54.61 9300 6.7.0 - WP005882-SCOLO-NODE4
9uRH 10.137.48.60 9300 6.7.0 - WP005881-SCOLO-NODE3
VzRI 10.105.50.53 9300 6.7.0 - WP005885-ONYX-NODE1
JqqW 10.105.158.40 9300 6.7.0 * WP005878-ONYX-NODE2

Thanks for yours & @jasontedor help on this.
The Machine Learning thing i think wasnt realated at all in hindsight, it looks like when i started that node back up the 6.5.1 version of the app was the one that started not 6.7

Just to summarise in case anyone else if having the same issues

Node upgraded from 6.5.1 to 6.7.0 failed with error failed to send join request to master IOException[Invalid string; unexpected character: 255 hex: ff]; ]

Workaround is to upgrade entire cluster to 6.6.2 first

use GET _cat/nodes?v&h=id,ip,port,v,m,n to verify that all nodes are at 6.6.2

Then upgrade to 6.7

I did also encounter a couple of other issues that were easy enough to resolve

  1. Service doesn’t start if dir %CONFIG_DIR%/ingest-geoip exists- just renamed to ingest-geoip_old & then started up the service

  2. MSI installer continually fails to install 6.6.2. Uninstalled 6.5.1 using the remove programs & installed from new. our data & config dirs existed outside of the program install location so wasn’t removed during the uninstall however if they did then they would have been removed so back them up first if you haven’t already

2 Likes

It's a risk, yes. The cluster will be unavailable while you're upgrading that node, but should return to health once the node restarts. If you can't handle this then the safest path to take is something like this:

  • temporarily start another empty 6.6.2 node; wait for the cluster health to be green.
  • upgrade the other nodes to 6.6.2
  • bring the already-upgraded 6.7.0 node up
  • decommission the temporary node by adding shard allocation filters to move any data onto the other nodes; wait for this process to complete
  • shut down the temporary node
  • upgrade the other two 6.6.2 nodes.
1 Like

@John_Swift I am not sure I would rule out the issue being related to Machine Learning yet.

We upgraded our cluster from 6.4.3 to 6.7.0, when the first node started up the IOException[Invalid string; unexpected character: 255 hex: ff] message was being logged. As the only jobs we had were test ones, we were able to remove them. Once the jobs were removed, the node was able to rejoin the cluster without issue.

That's right @bmagistro, this is caused by a problem in how ML data feeds are transferred over the network, fixed in #40610.

Fortunately I was able to rpm downgrade 6.7 to 6.6.2 and index compatibility was maintained. This allowed us to perform a rolling upgrade to 6.6.2 and then 6.7 without losing cluster resilience.

I'm glad to hear that this worked for you, but I should point out for the benefit of other readers that downgrading a node is very much unsupported and can result in a very broken cluster. The supported path forward is to start new 6.6.2 nodes.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.