Cannot Discover Master Node after upgrade to ES 6.0.0

I did a rolling upgrade on our 3 node ES cluster from 5.6.3 to 6.0.0. In this process 2/3 nodes were able to discover each other and the 3rd node is still not able to discover master, and the cluster state is red since then.

Here are the settings for ES:

cluster.name: "es-at-221b"
network.host: 0.0.0.0
network.publish_host: _ec2:privateIp_
cloud.node.auto_attributes: true
discovery:
    zen:
      hosts_provider: ec2
      minimum_master_nodes: 2
    ec2:
      availability_zones: us-west-2a
      tag.system: es-at-221b-nodes
      host_type: "private_ip"
xpack.security.enabled: false
xpack.monitoring.enabled: true
xpack.ml.enabled: false
xpack.graph.enabled: false
xpack.watcher.enabled: false
bootstrap.memory_lock: false

Running on amazon-linux: Amazon Linux AMI 2017.09.0.20170930 x86_64 HVM and running inside docker container with 9200 and 9300 exposed and bound to the host.

Here are the logs:

[2017-11-16T22:16:59,063][INFO ][o.e.n.Node               ] [] initializing ...
[2017-11-16T22:16:59,142][INFO ][o.e.e.NodeEnvironment    ] [UwrqR1o] using [1] data paths, mounts [[/usr/share/elasticsearch/data (/dev/xvda1)]], net usable_space [100.0gb], net total_space [100.0gb], types [ext4]
[2017-11-16T22:16:59,142][INFO ][o.e.e.NodeEnvironment    ] [UwrqR1o] heap size [15.9gb], compressed ordinary object pointers [true]
[2017-11-16T22:16:59,144][INFO ][o.e.n.Node               ] node name [UwrqR1o] derived from node ID [UwrqR1onT0K2wTs2IYxA2A]; set [node.name] to override
[2017-11-16T22:16:59,144][INFO ][o.e.n.Node               ] version[6.0.0], pid[1], build[8f0685b/2017-11-10T18:41:22.859Z], OS[Linux/4.9.58-18.55.amzn1.x86_64/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/1.8.0_151/25.151-b12]
[2017-11-16T22:16:59,144][INFO ][o.e.n.Node               ] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -XX:+HeapDumpOnOutOfMemoryError, -Des.cgroups.hierarchy.override=/, -Xms16g, -Xmx16g, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/usr/share/elasticsearch/config]
[2017-11-16T22:17:00,963][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [aggs-matrix-stats]
[2017-11-16T22:17:00,963][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [analysis-common]
[2017-11-16T22:17:00,963][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [ingest-common]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [lang-expression]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [lang-mustache]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [lang-painless]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [parent-join]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [percolator]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [reindex]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [repository-url]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [transport-netty4]
[2017-11-16T22:17:00,964][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded module [tribe]
[2017-11-16T22:17:00,965][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded plugin [discovery-ec2]
[2017-11-16T22:17:00,965][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded plugin [ingest-geoip]
[2017-11-16T22:17:00,965][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded plugin [ingest-user-agent]
[2017-11-16T22:17:00,965][INFO ][o.e.p.PluginsService     ] [UwrqR1o] loaded plugin [x-pack]
[2017-11-16T22:17:03,245][INFO ][o.e.d.DiscoveryModule    ] [UwrqR1o] using discovery type [zen]
[2017-11-16T22:17:03,955][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2017-11-16T22:17:03,964][WARN ][o.e.d.c.m.IndexTemplateMetaData] Deprecated field [template] used, replaced by [index_patterns]
[2017-11-16T22:17:04,107][INFO ][o.e.n.Node               ] initialized
[2017-11-16T22:17:04,107][INFO ][o.e.n.Node               ] [UwrqR1o] starting ...
[2017-11-16T22:17:04,242][INFO ][o.e.t.TransportService   ] [UwrqR1o] publish_address {xxx.xx.xx.xxx:9300}, bound_addresses {0.0.0.0:9300}
[2017-11-16T22:17:04,260][INFO ][o.e.b.BootstrapChecks    ] [UwrqR1o] bound or publishing to a non-loopback or non-link-local address, enforcing bootstrap checks
[2017-11-16T22:17:07,944][WARN ][o.e.d.z.ZenDiscovery     ] [UwrqR1o] not enough master nodes discovered during pinging (found [[Candidate{node={UwrqR1o}{UwrqR1onT0K2wTs2IYxA2A}{bQ7yNXVfTiS9kqv7CZrsNQ}{xxx.xx.xx.xxx}{xxx.xx.xx.xxx:9300}{aws_availability_zone=us-west-2a}, clusterStateVersion=-1}]], but needed [2]), pinging again
[2017-11-16T22:17:11,039][WARN ][o.e.d.z.ZenDiscovery     ] [UwrqR1o] not enough master nodes discovered during pinging (found [[Candidate{node={UwrqR1o}{UwrqR1onT0K2wTs2IYxA2A}{bQ7yNXVfTiS9kqv7CZrsNQ}{xxx.xx.xx.xxx}{xxx.xx.xx.xxx:9300}{aws_availability_zone=us-west-2a}, clusterStateVersion=-1}]], but needed [2]), pinging again

Can anyone point me in the right direction?

Did you update the discovery plugin as well?

Yes I did. It was a fresh docker image for 6.0.0 on top of which ran bin/elasticsearch-plugin install --batch discovery-ec2 to get the latest discovery plugin.

I was able to solve this issue

endpoint: ec2.us-west-2.amazonaws.com was able to help the nodes discover each other. What is weird that the other two nodes were able to discover each other without this setting. I guess the saved state on the node itself.

cluster.name: "es-at-221b"
network.host: 0.0.0.0
network.publish_host: _ec2:privateIp_
cloud.node.auto_attributes: true
discovery:
    zen:
      hosts_provider: ec2
      minimum_master_nodes: 2
    ec2:
      endpoint: ec2.us-west-2.amazonaws.com
      availability_zones: us-west-2a
      tag.system: es-at-221b-nodes
      host_type: "private_ip"
xpack.security.enabled: false
xpack.monitoring.enabled: true
xpack.ml.enabled: false
xpack.graph.enabled: false
xpack.watcher.enabled: false
bootstrap.memory_lock: false
2 Likes

I had the same problem on the first ES6 node I added to our ES 5.6 cluster as part of a rolling update. This requirement should really be documented. As far as I can see from the logs, the ES6 node doesn't even attempt to use the ec2 discovery plugin without the endpoint setting defined.

@dadoonet are you able to comment on this?

I think that if you don’t define the endpoint, elasticsearch 5.6 should complain in the deprecated logs. So it should tell you that you need to change those settings before upgrading.

Did you see that @trondhindenes?

1 Like

Did not check those logs, sorry. The upgrade helper did not say anything about it.
From the documentation I don't get the impression that endpoint is a required parameter - it basically says "all you need is to set zen discovery mode to ec2 and your good". Maybe a better separation between required and optional parameters would make it easier. Also, Elasticsearch normally stops/restarts if there's anything wrong with the settings, but in this case it didn't. To me it feels like there's a fairly serious bug in the ec2 discovery plugin for es6.

The upgrade helper did not say anything about it.

Did you activate the deprecation logger?

From the documentation I don't get the impression that endpoint is a required parameter

You're right. If you are not using specific key/secret, everything should be read from the metadata instance:

It does not say that endpoint is mandatory: https://www.elastic.co/guide/en/elasticsearch/plugins/current/_settings.html

endpoint: The ec2 service endpoint to connect to. This will be automatically figured out by the ec2 client based on the instance location, but can be specified explicitly. See http://docs.aws.amazon.com/general/latest/gr/rande.html#ec2_region.

Could you share the settings you were using in 5.6?

Thanks!

These are the settings we used on 5.6 without any problems:

cloud:
    aws:
        region: "{{ es_aws_region }}"
cluster.name: "{{ es_cluster_name }}"
cluster.routing.allocation.awareness.attributes: az

node.data: {{ es_data_node_enabled }}
node.name: "{{ inventory_hostname | lower }}"
path.data: "{{ es_data_path }}"
path.logs: "{{ es_logs_path }}"

network.host: _site_
network.bind_host: {{ es_internal_bind_host }}

http.port: {{ es_internal_listen_port }}
node.attr.az: {{ ec2_metadata_az.stdout }}
discovery:
    zen.hosts_provider: ec2
    ec2:
        host_type: private_ip
        groups: {{ es_disc_sg }}
        any_group: false
node.max_local_storage_nodes: 1

For ES6, we had to change to:

cluster.name: "{{ es_cluster_name }}"
cluster.routing.allocation.awareness.attributes: az

node.data: {{ es_data_node_enabled }}
node.name: "{{ inventory_hostname | lower }}"
path.data: "{{ es_data_path }}"
path.logs: "{{ es_logs_path }}"

network.host: _site_
network.bind_host: {{ es_internal_bind_host }}

http.port: {{ es_internal_listen_port }}
node.attr.az: {{ ec2_metadata_az.stdout }}
discovery:
    zen.hosts_provider: ec2
    ec2:
      endpoint: ec2.eu-west-1.amazonaws.com
      host_type: private_ip
      groups: {{ es_disc_sg }}
      any_group: false
node.max_local_storage_nodes: 1

So in short, we got rid of the cloud section (that was clearly documented, and the first ES6 nodes also refused to start with it in config, so that's all good). I was however surprised that the logs didn't show anything else related to discovery, my impression was that the node just skipped zen discovery altogether until I set the endpoint attribute.

Btw, we're using EC2 instance roles so no credentials are added to the config.

1 Like

I think we should improve the documentation about endpoint.

As you were using region, it should have complained with 5.6 as it’s marked as deprecated:

3 Likes

That would have gone in the deprecation log, right? I'm guilty. We didn't pay well enough attention to it. I was trusting the upgrade advisor too much I guess.

@Clinton_Gormley1 Is the migration assistant supposed to detect deprecated settings including the ones defined in plugins?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

@dadoonet The Upgrade Assistant only shows cluster settings, node settings, index mappings, and machine learning settings that are deprecated. It doesn't show anything specific to plugins.