Hi, thanks for you questions. In an effort to reproduce the issue and get more logs, I gave the rolling upgrade procedure another try by creating a fresh new cluster manually with 3 master nodes and 3 data nodes. In short the rolling upgrade method starting with data nodes first and then rolling the master nodes worked.
I was able to trace the issue we faced before to an auto deploy CI step which updated only one master node with the default distribution. We faced the issue as reported above with the following steps:
- Only one master node was upgraded to default distribution while the other 2 master nodes remained on oss. The cluster remained green and stable. This step was triggered automatically on new docker image build with default distribution.
- Update data nodes to default distribution. Upgrade completed and the cluster remained green.
- Trigger rolling update of master nodes. This step causes havoc and the cluster is lost.
Some logs from each master node on step 3:
es-master-pure-2 (the very first node already on default distribution):
master node changed {previous [{es-master-pure-1}{FkKNvBXATUy7rs5wH64yGQ}{bLAEwN00R46grZzEIG3pxQ}{10.2.12.88}{10.2.12.88:9300}{m}{zone=eu-central-1b,eu-central-1b-fake, group=master}], current []}, term: 1, version: 16, reason: becoming candidate: onLeaderFailure
waiting for elected master node [null] to setup local exporter [default_local] (does it have x-pack installed?)
elected-as-master ([2] nodes joined)[{es-master-pure-0}{dSFqwog6T1SaBOacZU7e-A}{CSogFY3qQ7Wh80gcX9afzQ}{10.2.29.7}{10.2.29.7:9300}{m}{zone=eu-central-1a,eu-central-1a-fake, group=master} elect leader, {es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, ml.max_open_jobs=20, group=master} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 2, version: 17, reason: master node changed {previous [], current [{es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, ml.max_open_jobs=20, group=master}]}
master node changed {previous [], current [{es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, ml.max_open_jobs=20, group=master}]}, term: 2, version: 17, reason: Publication{term=2, version=17}
Starting template upgrade to version 7.4.1, 8 templates will be updated and 0 will be removed
node-left[{es-master-pure-1}{FkKNvBXATUy7rs5wH64yGQ}{bLAEwN00R46grZzEIG3pxQ}{10.2.12.88}{10.2.12.88:9300}{m}{zone=eu-central-1b,eu-central-1b-fake, group=master} disconnected], term: 2, version: 18, reason: removed {{es-master-pure-1}{FkKNvBXATUy7rs5wH64yGQ}{bLAEwN00R46grZzEIG3pxQ}{10.2.12.88}{10.2.12.88:9300}{m}{zone=eu-central-1b,eu-central-1b-fake, group=master},}
Templates were upgraded successfully to version 7.4.1
failing [put-lifecycle-watch-history-ilm-policy]: failed to commit cluster state version [36],, \"cluster.uuid\": \"J5PcdLRGTI2X1DRjIsdKkA\", \"node.id\": \"PPR8fyBVSGmvjy_d9h-5Ew\"
es-master-pure-1 (node that's currently being restarted for upgrade)
failed to join {es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, ml.max_open_jobs=20, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, group=master} with JoinRequest{sourceNode={es-master-pure-1}{9ItooNM0T1WN0C-iDQ6wXA}{ZEGDcPHvQw-REy-uvI812g}{10.2.3.16}{10.2.3.16:9300}{lm}{ml.machine_memory=2097152000, xpack.installed=true, zone=eu-central-1b,eu-central-1b-fake, ml.max_open_jobs=20, group=master}, optionalJoin=Optional[Join{term=281, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={es-master-pure-1}{9ItooNM0T1WN0C-iDQ6wXA}{ZEGDcPHvQw-REy-uvI812g}{10.2.3.16}{10.2.3.16:9300}{lm}{ml.machine_memory=2097152000, xpack.installed=true, zone=eu-central-1b,eu-central-1b-fake, ml.max_open_jobs=20, group=master}, targetNode={es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, ml.max_open_jobs=20, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, group=master}}]}
es-master-pure-0 (the node that's still on oss pending upgrade)
master node changed {previous [{es-master-pure-1}{FkKNvBXATUy7rs5wH64yGQ}{bLAEwN00R46grZzEIG3pxQ}{10.2.12.88}{10.2.12.88:9300}{m}{zone=eu-central-1b,eu-central-1b-fake, group=master}], current []}, term: 1, version: 16, reason: becoming candidate: onLeaderFailure
master node changed {previous [], current [{es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, ml.max_open_jobs=20, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, group=master}]}, term: 2, version: 17, reason: ApplyCommitRequest{term=2, version=17, sourceNode={es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, ml.max_open_jobs=20, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, group=master}}
removed {{es-master-pure-1}{FkKNvBXATUy7rs5wH64yGQ}{bLAEwN00R46grZzEIG3pxQ}{10.2.12.88}{10.2.12.88:9300}{m}{zone=eu-central-1b,eu-central-1b-fake, group=master},}, term: 2, version: 18, reason: ApplyCommitRequest{term=2, version=18, sourceNode={es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, ml.max_open_jobs=20, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, group=master}}
unexpected error while deserializing an incoming cluster state , \"cluster.uuid\": \"J5PcdLRGTI2X1DRjIsdKkA\", \"node.id\": \"dSFqwog6T1SaBOacZU7e-A\
"log":"\"stacktrace\": [\"java.lang.IllegalArgumentException: Unknown NamedWriteable [org.elasticsearch.cluster.metadata.MetaData$Custom][index_lifecycle]\"
failed to join {es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, ml.max_open_jobs=20, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, group=master} with JoinRequest{sourceNode={es-master-pure-0}{dSFqwog6T1SaBOacZU7e-A}{CSogFY3qQ7Wh80gcX9afzQ}{10.2.29.7}{10.2.29.7:9300}{m}{zone=eu-central-1a,eu-central-1a-fake, group=master}, optionalJoin=Optional[Join{term=3, lastAcceptedTerm=2, lastAcceptedVersion=35, sourceNode={es-master-pure-0}{dSFqwog6T1SaBOacZU7e-A}{CSogFY3qQ7Wh80gcX9afzQ}{10.2.29.7}{10.2.29.7:9300}{m}{zone=eu-central-1a,eu-central-1a-fake, group=master}, targetNode={es-master-pure-2}{PPR8fyBVSGmvjy_d9h-5Ew}{deicZNgqR0GNvyBGKLUoLA}{10.2.31.16}{10.2.31.16:9300}{lm}{ml.machine_memory=2097152000, ml.max_open_jobs=20, xpack.installed=true, zone=eu-central-1c,eu-central-1c-fake, group=master}}]}
I excluded stack traces for brevity. Let me know if more information is needed.