Node couldn't rejoin cluster because "version not supported"

I've got an Elasticsearch cluster of four nodes. Three can hold data and are master eligible. One cannot hold data, isn't master eligible and has Kibana on it. Two of the master eligible nodes became unhappy resulting in a brief period when there was no cluster. The cluster sorted itself out back to Green, but the Kibana node (agdud) got left out. In the Elasticsearch log on agdud there are lots of instances of this:

[2021-02-24T14:42:57,977][INFO ][o.e.c.c.JoinHelper       ] [agdud] failed to join {agdub}{gFWyyiOTSl-OVNky9YQmZw}{fuXPwsw-RIy7Nv6BEBIcMw}{10.70.13.42}{10.70.13.42:9300}{cdhilmrstw}{ml.machine_memory=1927577600, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true} with JoinRequest{sourceNode={agdud}{cVGPTdpeRneDaBqu5aCDbw}{VP8A1k8LTziq2-tuR_I3zQ}{10.70.13.78}{10.70.13.78:9300}{ilr}{ml.machine_memory=6087639040, xpack.installed=true, transform.node=false, ml.max_open_jobs=20}, minimumTerm=41, optionalJoin=Optional[Join{term=41, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={agdud}{cVGPTdpeRneDaBqu5aCDbw}{VP8A1k8LTziq2-tuR_I3zQ}{10.70.13.78}{10.70.13.78:9300}{ilr}{ml.machine_memory=6087639040, xpack.installed=true, transform.node=false, ml.max_open_jobs=20}, targetNode={agdub}{gFWyyiOTSl-OVNky9YQmZw}{fuXPwsw-RIy7Nv6BEBIcMw}{10.70.13.42}{10.70.13.42:9300}{cdhilmrstw}{ml.machine_memory=1927577600, ml.max_open_jobs=20, xpack.installed=true, ml.max_jvm_size=1073741824, transform.node=true}}]}
org.elasticsearch.transport.RemoteTransportException: [agdub][10.70.13.42:9300][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: index [ilm-history-3-000004/vpxF3Wm2QeePoBbqNtHHCA] version not supported: 7.11.0 the node version is: 7.10.2

agdub is the master node. The index mentioned is always ilm-history-3-000004.

All the data nodes were running Elasticsearch 7.11.0 but agdud was still running 7.10.2. Updating it to 7.11.0 solved the problem. (It would have automatically got updated to 7.11.0 about 12 hours later.) The creation date of the ilm-history-3-000004 index is after all the data nodes were updated to 7.11.0.

Can someone explain why agdud wasn't allowed to join the cluster while running 7.10.2? It seems like it's because the ilm-history-3-000004 index had been created when the master node and all the data nodes were running 7.11.0, but agdud doesn't hold data and apparently it wasn't a problem that it was running a slightly older version when the index was created.

Yes, the only legitimate reason for having a mix of nodes in your cluster is if you are in the middle of a rolling upgrade, and the rolling upgrade docs answer your questions:

Running multiple versions of Elasticsearch in the same cluster beyond the duration of an upgrade is not supported,

and in the IMPORTANT bit at the bottom:

In the unlikely case of a network malfunction during the upgrade process that isolates all remaining old nodes from the cluster, you must take the old nodes offline and upgrade them to enable them to join the cluster.

2 Likes

Really useful, thanks. I hadn't thought of it as a rolling upgrade scenario, but that's really what was effectively happening.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.