Upgrade from 6.8.23 to 7.3.2 fails with master not discovered yet

I am testing upgrading our cluster from 6.8.23 to 7.3.2 but the ingest nodes fail the upgrade. Both master & data work fine. As per the documentation, this upgrade only needs rolling restart and doesn't require cluster.initial_master_nodes.

Can I get help as to why only the ingest nodes fail?

Logs from ingest node:

[2024-07-16T22:56:22,961][INFO ][o.e.d.DiscoveryModule    ] [usw2b-9297126f-es-timeline-ingest-dev] using discovery type [zen] and seed hosts providers [settings, ec2]
[2024-07-16T22:56:38,516][INFO ][o.e.n.Node               ] [usw2b-9297126f-es-timeline-ingest-dev] initialized
[2024-07-16T22:56:38,516][INFO ][o.e.n.Node               ] [usw2b-9297126f-es-timeline-ingest-dev] starting ...
[2024-07-16T22:56:41,139][INFO ][o.e.t.TransportService   ] [usw2b-9297126f-es-timeline-ingest-dev] publish_address {10.27.5.217:9300}, bound_addresses {[::]:9300}
[2024-07-16T22:56:41,250][INFO ][o.e.b.BootstrapChecks    ] [usw2b-9297126f-es-timeline-ingest-dev] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2024-07-16T22:56:51,608][WARN ][o.e.c.c.ClusterFormationFailureHelper] [usw2b-9297126f-es-timeline-ingest-dev] master not discovered yet: have discovered [{usw2b-9297126f-es-timeline-ingest-dev}{SeAMikyqRk6Ex10E3j6d8Q}{9v6Fp1tsSXyd8daThv-Png}{10.2
7.5.217}{10.27.5.217:9300}{i}{aws_availability_zone=us-west-2b, ml.machine_memory=33230782464, xpack.installed=true, ml.max_open_jobs=20}]; discovery will continue using [] from hosts providers and [] from last-known cluster state; node term 0, last-accepted version 0 in term 0
[2024-07-16T22:56:55,663][INFO ][o.e.c.s.ClusterApplierService] [usw2b-9297126f-es-timeline-ingest-dev] master node changed {previous [], current [{usw2a-c5deb26a-es-timeline-master-dev}{EBFHDajgSVytFUsFjfnaYg}{Pvxz2w-pRnSlTHZZntR89w}{10.27.4.207}{10.27.4.207:9300}{m}{aws_availability_zone=us-west-2a, ml.machine_memory=16524873728, ml.max_open_jobs=20, xpack.installed=true}]}, added {{usw2a-716b708b-es-timeline-data-dev}{tJrjr55-T1WaQLqwuZb84A}{j0Q9RmlhTIu2NZyIk-MX3g}{10.27.4.56}{10.27.4.56:9300}{d}{aws_availability_zone=us-west-2a, ml.machine_memory=66700505088, ml.max_open_jobs=20, xpack.installed=true},{usw2c-ddb57b6f-es-timeline-ingest-dev}{CfXvYYg7S5ic3Au2pW0f5w}{-pWtDSVfRsS3KGZ36EhBKg}{10.27.6.29}{10.27.6.29:9300}{i}{aws_availability_zone=us-west-2c, ml.machine_memory=33230782464, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{usw2b-902bb083-es-timeline-master-dev}{L1SzWfdXTb2UK4EyzIQAzA}{KmQdmrcYQUy5_yplLJf_pg}{10.27.5.66}{10.27.5.66:9300}{m}{aws_availability_zone=us-west-2b, ml.machine_memory=16524873728, ml.max_open_jobs=20, xpack.installed=true},{usw2c-d54260f0-es-timeline-master-dev}{qNIX5n4zRkuXEmJ0GmYBRA}{IKoYgx4-Q2yFCnf1JD-U0A}{10.27.6.175}{10.27.6.175:9300}{m}{aws_availability_zone=us-west-2c, ml.machine_memory=16524873728, ml.max_open_jobs=20, xpack.installed=true},{usw2a-4509bcbc-es-timeline-ingest-dev}{MfYpaU2MQO2DpsE7ltOFDg}{RbAcGiH9Q8GHvaOiN3bqzg}{10.27.4.110}{10.27.4.110:9300}{i}{aws_availability_zone=us-west-2a, ml.machine_memory=33230782464, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},{usw2a-c5deb26a-es-timeline-master-dev}{EBFHDajgSVytFUsFjfnaYg}{Pvxz2w-pRnSlTHZZntR89w}{10.27.4.207}{10.27.4.207:9300}{m}{aws_availability_zone=us-west-2a, ml.machine_memory=16524873728, ml.max_open_jobs=20, xpack.installed=true},{usw2c-3da4f261-es-timeline-data-dev}{JSU-ySanSlOsN96HS9G-ag}{DIL3wjGJQj6DztELG1WXgw}{10.27.6.236}{10.27.6.236:9300}{d}{aws_availability_zone=us-west-2c, ml.machine_memory=66700505088, ml.max_open_jobs=20, xpack.installed=true},{usw2b-a796520d-es-timeline-data-dev}{OsM6-5PlTtmXUvZp9dwb-A}{b6xnZ29-TaKp7QU92IMwYg}{10.27.5.250}{10.27.5.250:9300}{d}{aws_availability_zone=us-west-2b, ml.machine_memory=66700505088, ml.max_open_jobs=20, xpack.installed=true},}, term: 1, version: 82, reason: ApplyCommitRequest{term=1, version=82, sourceNode={usw2a-c5deb26a-es-timeline-master-dev}{EBFHDajgSVytFUsFjfnaYg}{Pvxz2w-pRnSlTHZZntR89w}{10.27.4.207}{10.27.4.207:9300}{m}{aws_availability_zone=us-west-2a, ml.machine_memory=16524873728, ml.max_open_jobs=20, xpack.installed=true}}
[2024-07-16T22:56:56,152][INFO ][o.e.x.s.a.TokenService   ] [usw2b-9297126f-es-timeline-ingest-dev] refresh keys
[2024-07-16T22:56:56,364][INFO ][o.e.x.s.a.TokenService   ] [usw2b-9297126f-es-timeline-ingest-dev] refreshed keys
[2024-07-16T22:56:56,724][INFO ][o.e.l.LicenseService     ] [usw2b-9297126f-es-timeline-ingest-dev] license [a1c99395-300c-4651-99ad-c40f24ac5cdc] mode [basic] - valid
[2024-07-16T22:56:56,725][INFO ][o.e.x.s.s.SecurityStatusChangeListener] [usw2b-9297126f-es-timeline-ingest-dev] Active license is now [BASIC]; Security is enabled
[2024-07-16T22:56:56,843][INFO ][o.e.h.AbstractHttpServerTransport] [usw2b-9297126f-es-timeline-ingest-dev] publish_address {10.27.5.217:9200}, bound_addresses {[::]:9200}
[2024-07-16T22:56:56,843][INFO ][o.e.n.Node               ] [usw2b-9297126f-es-timeline-ingest-dev] started
[2024-07-16T22:57:28,778][WARN ][o.e.c.l.LogConfigurator  ] [usw2b-9297126f-es-timeline-ingest-dev] Some logging configurations have %marker but don't have %node_name. We will automatically add %node_name to the pattern to ease the migration for users who customize log4j2.properties but will stop this behavior in 7.0. You should manually replace `%node_name` with `[%node_name]%marker ` in these locations:
  /mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2/config/log4j2.properties
[2024-07-16T22:57:32,912][INFO ][o.e.e.NodeEnvironment    ] [usw2b-9297126f-es-timeline-ingest-dev] using [1] data paths, mounts [[/mnt/apps/elasticsearch (/dev/nvme3n1)]], net usable_space [148.8gb], net total_space [149.9gb], types [xfs]
[2024-07-16T22:57:32,913][INFO ][o.e.e.NodeEnvironment    ] [usw2b-9297126f-es-timeline-ingest-dev] heap size [15.2gb], compressed ordinary object pointers [true]
[2024-07-16T22:57:32,918][INFO ][o.e.n.Node               ] [usw2b-9297126f-es-timeline-ingest-dev] node name [usw2b-9297126f-es-timeline-ingest-dev], node ID [SeAMikyqRk6Ex10E3j6d8Q], cluster name [timeline]
[2024-07-16T22:57:32,918][INFO ][o.e.n.Node               ] [usw2b-9297126f-es-timeline-ingest-dev] version[7.3.2], pid[26576], build[default/tar/1c1faf1/2019-09-06T14:40:30.409026Z], OS[Linux/4.14.345-262.561.amzn2.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/12.0.2/12.0.2+10]
[2024-07-16T22:57:32,919][INFO ][o.e.n.Node               ] [usw2b-9297126f-es-timeline-ingest-dev] JVM home [/mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2/jdk]
[2024-07-16T22:57:32,919][INFO ][o.e.n.Node               ] [usw2b-9297126f-es-timeline-ingest-dev] JVM arguments [-XX:NewRatio=5, -XX:+IgnoreUnrecognizedVMOptions, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+HeapDumpOnOutOfMemoryError, -Dlog4j2.disable.jmx=true, -Dlog4j2.formatMsgNoLookups=true, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -Xloggc:/mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/logs/gc-cms-%t.log, -XX:+UseGCLogFileRotation, -Xlog::::filecount=10,filesize=50M, -Dnetworkaddress.cache.ttl=60, -Xms15845m, -Xmx15845m, -Djna.tmpdir=/mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/tmp, -Dio.netty.allocator.type=pooled, -XX:MaxDirectMemorySize=8307867648, -Des.path.home=/mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2, -Des.path.conf=/mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2/config, -Des.distribution.flavor=default, -Des.distribution.type=tar, -Des.bundled_jdk=true]
[2024-07-16T22:57:37,333][WARN ][o.e.x.w.Watcher          ] [usw2b-9297126f-es-timeline-ingest-dev] the [action.auto_create_index] setting is configured to be restrictive [.marvel-*,relateiq-*,.security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*,logstash*,metricbeat-*,.kibana].  for the next 6 months daily history indices are allowed to be created, but please make sure that any future history indices after 6 months with the pattern [.watcher-history-yyyy.MM.dd] are allowed to be created
[2024-07-16T22:57:48,138][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [aggs-matrix-stats]
[2024-07-16T22:57:48,139][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [analysis-common]
[2024-07-16T22:57:48,139][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [data-frame]
[2024-07-16T22:57:48,139][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [flattened]
[2024-07-16T22:57:48,139][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [ingest-common]
[2024-07-16T22:57:48,139][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [ingest-geoip]
[2024-07-16T22:57:48,139][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [ingest-user-agent]
[2024-07-16T22:57:48,139][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [lang-expression]
[2024-07-16T22:57:48,140][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [lang-mustache]
[2024-07-16T22:57:48,140][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [lang-painless]
[2024-07-16T22:57:48,140][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [mapper-extras]
[2024-07-16T22:57:48,140][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [parent-join]
[2024-07-16T22:57:48,140][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [percolator]
[2024-07-16T22:57:48,140][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [rank-eval]
[2024-07-16T22:57:48,141][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [reindex]
[2024-07-16T22:57:48,141][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [repository-url]
[2024-07-16T22:57:48,141][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [transport-netty4]
[2024-07-16T22:57:48,141][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [vectors]
[2024-07-16T22:57:48,141][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-ccr]
[2024-07-16T22:57:48,141][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-core]
[2024-07-16T22:57:48,142][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-deprecation]
[2024-07-16T22:57:48,142][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-graph]
[2024-07-16T22:57:48,142][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-ilm]
[2024-07-16T22:57:48,142][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-logstash]
[2024-07-16T22:57:48,142][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-ml]
[2024-07-16T22:57:48,142][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-monitoring]
[2024-07-16T22:57:48,142][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-rollup]
[2024-07-16T22:57:48,143][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-security]
[2024-07-16T22:57:48,143][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-sql]
[2024-07-16T22:57:48,143][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-voting-only-node]
[2024-07-16T22:57:48,143][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded module [x-pack-watcher]
[2024-07-16T22:57:48,144][INFO ][o.e.p.PluginsService     ] [usw2b-9297126f-es-timeline-ingest-dev] loaded plugin [discovery-ec2]
[2024-07-16T22:57:53,321][INFO ][i.n.u.i.PlatformDependent] [usw2b-9297126f-es-timeline-ingest-dev] Your platform does not provide complete low-level API for accessing direct buffers reliably. Unless explicitly requested, heap buffer will always be preferred to avoid potential system instability.

This is the config on ingest node:

cluster.name: timeline
path.data: /mnt/apps/elasticsearch/data
path.logs: /mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/logs
node.name: usw2b-9297126f-es-timeline-ingest-dev
node.ingest: true
node.data: false
node.master: false
action.auto_create_index: ".marvel-*,relateiq-*,.security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*,logstash*,metricbeat-*,.kibana"
network.host: 0.0.0.0

discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 6s
discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

discovery.zen.hosts_provider: ec2
discovery.ec2.endpoint: ec2.us-west-2.amazonaws.com
discovery.ec2.groups: sg-029f11f45bf46463a
discovery.ec2.tag.ReplicaResourceId: search-timeline-shared-rray


cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone

xpack.security.enabled: true
xpack.security.http.ssl.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate

##transport TLS settings
xpack.security.transport.ssl.key: /mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.key
xpack.security.transport.ssl.certificate: /mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.crt
xpack.security.transport.ssl.client_authentication: none

##http TLS settings
xpack.security.http.ssl.key: /mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.key
xpack.security.http.ssl.certificate: /mnt/agents/nomad/data/alloc/461f4100-4ce0-d0cf-3120-0bb4dbe4def1/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.crt

xpack.monitoring.collection.enabled: true
xpack.monitoring.elasticsearch.collection.enabled: true # this needs to be set to true if we want monitoring-es indices for legacy monitoring (no metricbeat)
xpack.monitoring.exporters.my_local.type: local
xpack.monitoring.exporters.my_local.use_ingest: false

xpack.security.authc.anonymous.authz_exception: true
xpack.security.authc.anonymous.roles: remote_monitoring_agent

xpack.security.authc.realms.file.file1.order: 0

xpack.security.authc.realms.native.native1.order: 0

reindex.remote.whitelist: ""
thread_pool.write.queue_size: 1000

Thank you!

That's a strange choice. 7.3.2 is almost 5 years old - if you're going to put in the effort to upgrade you should at least go to 7.17

You have configured this node to find other nodes by querying the EC2 API for members of the security group sg-029f11f45bf46463a with tag ReplicaResourceId = search-timeline-shared-rray

The most likely explanation is that your master node is not in that group and/or it does not have that tag.
What is the equivalent config in your master node's elasticsearch.yml?

We are not yet ready to upgrade our client so for now we plan to goto 7.3.2 and then 7.17

I verified the master nodes have the tags listed

Here is the config from master:

cluster.name: timeline
path.data: /mnt/apps/elasticsearch/data
path.logs: /mnt/agents/nomad/data/alloc/a41b8a41-6d87-4959-fa01-64325ad8dfd4/elasticsearch/local/logs
node.name: usw2a-c5deb26a-es-timeline-master-dev
node.ingest: false
node.data: false
node.master: true
action.auto_create_index: ".marvel-*,relateiq-*,.security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*,logstash*,metricbeat-*,.kibana"
network.host: 0.0.0.0

discovery.zen.minimum_master_nodes: 2
discovery.zen.ping_timeout: 6s
discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

discovery.zen.hosts_provider: ec2
discovery.ec2.endpoint: ec2.us-west-2.amazonaws.com
discovery.ec2.groups: sg-029f11f45bf46463a
discovery.ec2.tag.ReplicaResourceId: search-timeline-shared-rray


cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone

xpack.security.enabled: true
xpack.security.http.ssl.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate

##transport TLS settings
xpack.security.transport.ssl.key: /mnt/agents/nomad/data/alloc/a41b8a41-6d87-4959-fa01-64325ad8dfd4/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.key
xpack.security.transport.ssl.certificate: /mnt/agents/nomad/data/alloc/a41b8a41-6d87-4959-fa01-64325ad8dfd4/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.crt
xpack.security.transport.ssl.client_authentication: none

##http TLS settings
xpack.security.http.ssl.key: /mnt/agents/nomad/data/alloc/a41b8a41-6d87-4959-fa01-64325ad8dfd4/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.key
xpack.security.http.ssl.certificate: /mnt/agents/nomad/data/alloc/a41b8a41-6d87-4959-fa01-64325ad8dfd4/elasticsearch/local/elasticsearch-7.3.2/config/ssl/wildcard_2020_2021.crt

xpack.monitoring.collection.enabled: true
xpack.monitoring.elasticsearch.collection.enabled: true # this needs to be set to true if we want monitoring-es indices for legacy monitoring (no metricbeat)
xpack.monitoring.exporters.my_local.type: local
xpack.monitoring.exporters.my_local.use_ingest: false

xpack.security.authc.anonymous.authz_exception: true
xpack.security.authc.anonymous.roles: remote_monitoring_agent

xpack.security.authc.realms.file.file1.order: 0

xpack.security.authc.realms.native.native1.order: 0

reindex.remote.whitelist: ""
thread_pool.write.queue_size: 1000

The only reasonable explanation I have is that the machine-user for those ingest nodes does not have permission to see the full membership of that security group (or possible the tags).

I think you're going to have to try and verify what is being returned by the EC2 APIs on the ingest hosts.

If your client works against 7.3 then it will still work against 7.17.

Lots of deprecated/no-op settings there.

In any case this kind of thing is way easier to troubleshoot in 7.17 than 7.3.

1 Like