Discovery-ec2 plugin always tries to ping localhost / never finds the nodes that it should

Hi there.

I am new to elastic-search, but not new to systems administration. Apologies if i did miss something glaringly obvious.

I am trying to build an ES cluster in AWS. I am hoping to use the discovery-ec2 plugin to automate node discovery.

However, my cluster will not spin up. Every node of the cluster gets "stuck" waiting for more masters.

Topology:

3 Masters
2 Data nodes
1 Client

Here's a snip of the logs.

I have turned logger.org.elasticsearch.discovery to trace level on this node.

This particular node has a hostname of ip-172-25-34-206.

This node (like all others in the cluster) is ubuntu 18.04.

This looks ok...

[o.e.d.e.AwsEc2UnicastHostsProvider] [ip-172-25-34-206] using host_type [private_ip], tags [{Cluster=[prototype.testing-domain.com-ElasticSearchCluster]}], groups [[]] with any_group [true], availability_zones [[us-west-1b, us-west-1c]]
[2018-12-11T00:57:38,554][INFO ][o.e.d.DiscoveryModule    ] [ip-172-25-34-206] using discovery type [zen] and host providers [settings, ec2]
1 Like

Whatever software powers this discussion board, it's garbage. here's part 2 of 3 of my post:

Here's a snip of logs from one of the masters showing the actual problem:

    root@ip-172-25-34-206:/var/log/elasticsearch# service elasticsearch start && tail -f ElasticSearchCluster.log
    [2018-12-11T00:57:25,088][INFO ][o.e.e.NodeEnvironment    ] [ip-172-25-34-206] using [1] data paths, mounts [[/ (/dev/nvme0n1p1)]], net usable_space [3.5gb], net total_space [7.6gb], types [ext4]
    [2018-12-11T00:57:25,098][INFO ][o.e.e.NodeEnvironment    ] [ip-172-25-34-206] heap size [1007.3mb], compressed ordinary object pointers [true]
    [2018-12-11T00:57:25,099][INFO ][o.e.n.Node               ] [ip-172-25-34-206] node name [ip-172-25-34-206], node ID [rdjd-hXDRjieEPEeeeHyIA]
    [2018-12-11T00:57:25,099][INFO ][o.e.n.Node               ] [ip-172-25-34-206] version[6.5.2], pid[5138], build[default/deb/9434bed/2018-11-29T23:58:20.891072Z], OS[Linux/4.15.0-1029-aws/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_191/25.191-b12]
    [2018-12-11T00:57:25,100][INFO ][o.e.n.Node               ] [ip-172-25-34-206] JVM arguments [-Xms1g, -Xmx1g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.LaJjFVV8, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:/var/log/elasticsearch/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=default, -Des.distribution.type=deb]
    [2018-12-11T00:57:28,856][DEBUG][o.e.d.e.Ec2ClientSettings] [ip-172-25-34-206] Using either environment variables, system properties or instance profile credentials
    [2018-12-11T00:57:29,318][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [aggs-matrix-stats]
    [2018-12-11T00:57:29,318][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [analysis-common]
    [2018-12-11T00:57:29,318][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [ingest-common]
    [2018-12-11T00:57:29,318][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [lang-expression]
    [2018-12-11T00:57:29,318][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [lang-mustache]
    [2018-12-11T00:57:29,318][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [lang-painless]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [mapper-extras]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [parent-join]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [percolator]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [rank-eval]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [reindex]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [repository-url]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [transport-netty4]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [tribe]
    [2018-12-11T00:57:29,319][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-ccr]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-core]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-deprecation]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-graph]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-logstash]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-ml]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-monitoring]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-rollup]
    [2018-12-11T00:57:29,320][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-security]
    [2018-12-11T00:57:29,321][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-sql]
    [2018-12-11T00:57:29,321][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-upgrade]
    [2018-12-11T00:57:29,321][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded module [x-pack-watcher]
    [2018-12-11T00:57:29,328][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded plugin [discovery-ec2]
    [2018-12-11T00:57:29,328][INFO ][o.e.p.PluginsService     ] [ip-172-25-34-206] loaded plugin [repository-s3]

(snip, continued below...)

cont ...

[2018-12-11T00:57:29,380][DEBUG][o.e.d.e.Ec2DiscoveryPlugin] [ip-172-25-34-206] obtaining ec2 [placement/availability-zone] from ec2 meta-data url http://169.254.169.254/latest/meta-data/placement/availability-zone
    [2018-12-11T00:57:34,716][DEBUG][o.e.d.e.Ec2DiscoveryPlugin] [ip-172-25-34-206] Register _ec2_, _ec2:xxx_ network names
    [2018-12-11T00:57:37,133][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [ip-172-25-34-206] [controller/5206] [Main.cc@109] controller (64 bit): Version 6.5.2 (Build 767566e25172d6) Copyright (c) 2018 Elasticsearch BV
    [2018-12-11T00:57:38,550][DEBUG][o.e.d.z.SettingsBasedHostsProvider] [ip-172-25-34-206] using initial hosts [127.0.0.1, [::1]]
    [2018-12-11T00:57:38,552][DEBUG][o.e.d.e.AwsEc2UnicastHostsProvider] [ip-172-25-34-206] using host_type [private_ip], tags [{Cluster=[prototype.testing-domain.com-ElasticSearchCluster]}], groups [[]] with any_group [true], availability_zones [[us-west-1b, us-west-1c]]
    [2018-12-11T00:57:38,554][INFO ][o.e.d.DiscoveryModule    ] [ip-172-25-34-206] using discovery type [zen] and host providers [settings, ec2]
    [2018-12-11T00:57:38,556][DEBUG][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] using concurrent_connects [10], resolve_timeout [5s]
    [2018-12-11T00:57:38,558][DEBUG][o.e.d.z.ElectMasterService] [ip-172-25-34-206] using minimum_master_nodes [2]
    [2018-12-11T00:57:38,558][DEBUG][o.e.d.z.ZenDiscovery     ] [ip-172-25-34-206] using ping_timeout [3s], join.timeout [1m], master_election.ignore_non_master [false]
    [2018-12-11T00:57:38,560][DEBUG][o.e.d.z.MasterFaultDetection] [ip-172-25-34-206] [master] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
    [2018-12-11T00:57:38,571][DEBUG][o.e.d.z.NodesFaultDetection] [ip-172-25-34-206] [node  ] uses ping_interval [1s], ping_timeout [30s], ping_retries [3]
    [2018-12-11T00:57:40,109][INFO ][o.e.n.Node               ] [ip-172-25-34-206] initialized
    [2018-12-11T00:57:40,109][INFO ][o.e.n.Node               ] [ip-172-25-34-206] starting ...
    [2018-12-11T00:57:40,227][DEBUG][o.e.d.e.Ec2NameResolver  ] [ip-172-25-34-206] obtaining ec2 hostname from ec2 meta-data url http://169.254.169.254/latest/meta-data/local-ipv4
    [2018-12-11T00:57:40,331][DEBUG][o.e.d.e.Ec2NameResolver  ] [ip-172-25-34-206] obtaining ec2 hostname from ec2 meta-data url http://169.254.169.254/latest/meta-data/local-ipv4
    [2018-12-11T00:57:40,333][INFO ][o.e.t.TransportService   ] [ip-172-25-34-206] publish_address {172.25.34.206:9300}, bound_addresses {172.25.34.206:9300}
    [2018-12-11T00:57:40,345][INFO ][o.e.b.BootstrapChecks    ] [ip-172-25-34-206] bound or publishing to a non-loopback address, enforcing bootstrap checks
    [2018-12-11T00:57:40,397][TRACE][o.e.d.z.NodeJoinController] [ip-172-25-34-206] starting an election context, will accumulate joins
    [2018-12-11T00:57:40,401][TRACE][o.e.d.z.ZenDiscovery     ] [ip-172-25-34-206] starting to ping
    [2018-12-11T00:57:40,420][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] resolved host [127.0.0.1] to [127.0.0.1:9300, 127.0.0.1:9301, 127.0.0.1:9302, 127.0.0.1:9303, 127.0.0.1:9304]
    [2018-12-11T00:57:40,420][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] resolved host [[::1]] to [[::1]:9300, [::1]:9301, [::1]:9302, [::1]:9303, [::1]:9304]
    [2018-12-11T00:57:40,420][DEBUG][o.e.d.e.AwsEc2ServiceImpl] [ip-172-25-34-206] Using either environment variables, system properties or instance profile credentials
    [2018-12-11T00:57:41,477][TRACE][o.e.d.e.AwsEc2UnicastHostsProvider] [ip-172-25-34-206] building dynamic unicast discovery nodes...
    [2018-12-11T00:57:41,478][DEBUG][o.e.d.e.AwsEc2UnicastHostsProvider] [ip-172-25-34-206] using dynamic transport addresses []
    [2018-12-11T00:57:41,484][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{[::1]:9300}{weqPeT4oR2ml4eswAuUkpg}{0:0:0:0:0:0:0:1}{[::1]:9300}]
    [2018-12-11T00:57:41,487][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{127.0.0.1:9302}{DcNCABnUQTW-8BxzcImgcg}{127.0.0.1}{127.0.0.1:9302}]
    [2018-12-11T00:57:41,513][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{127.0.0.1:9304}{0GzDtp_2RbmgKi5FiYOmKg}{127.0.0.1}{127.0.0.1:9304}]
    [2018-12-11T00:57:41,525][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{127.0.0.1:9300}{aSOqBvfdR5iPHKewmGaEDQ}{127.0.0.1}{127.0.0.1:9300}]
    [2018-12-11T00:57:41,530][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{[::1]:9303}{w0ACsLgETuKNRtPDgArUTQ}{0:0:0:0:0:0:0:1}{[::1]:9303}]
    [2018-12-11T00:57:41,537][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{127.0.0.1:9303}{RxhdUAFdTMSZAtdyJQuOkA}{127.0.0.1}{127.0.0.1:9303}]
    [2018-12-11T00:57:41,546][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{127.0.0.1:9301}{csBsiPQoT6-qCcdP2xdjJQ}{127.0.0.1}{127.0.0.1:9301}]
    [2018-12-11T00:57:41,548][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{[::1]:9302}{hfKuzKZTRpWySKvKYOmXjg}{0:0:0:0:0:0:0:1}{[::1]:9302}]
    [2018-12-11T00:57:41,551][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{[::1]:9304}{TFb7Y4qCSo6tMlKSQQrbZg}{0:0:0:0:0:0:0:1}{[::1]:9304}]
    [2018-12-11T00:57:41,551][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{[::1]:9301}{JZL4Yd9VSn6tcObNPCE3cg}{0:0:0:0:0:0:0:1}{[::1]:9301}]
    [2018-12-11T00:57:41,579][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] failed to ping {127.0.0.1:9301}{csBsiPQoT6-qCcdP2xdjJQ}{127.0.0.1}{127.0.0.1:9301}
    org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9301] connect_exception
    	at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:165) ~[elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:454) ~[elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:117) ~[elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:237) ~[elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.transport.ConnectionManager.openConnection(ConnectionManager.java:95) ~[elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.transport.TransportService.openConnection(TransportService.java:393) ~[elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.discovery.zen.UnicastZenPing$PingingRound.getOrConnect(UnicastZenPing.java:364) ~[elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.discovery.zen.UnicastZenPing$3.doRun(UnicastZenPing.java:471) [elasticsearch-6.5.2.jar:6.5.2]
    	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:723) [elasticsearch-6.5.2.jar:6.5.2]
    

cont...

at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.5.2.jar:6.5.2]
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
    	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
    Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /127.0.0.1:9301


    [2018-12-11T00:57:42,503][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] opening connection to [{[::1]:9300}{2gfRXEeFQaui3c7kJ4lZkQ}{0:0:0:0:0:0:0:1}{[::1]:9300}]
    [2018-12-11T00:57:42,506][TRACE][o.e.d.z.UnicastZenPing   ] [ip-172-25-34-206] [1] failed to ping {127.0.0.1:9304}{fEHx1WWfSTqFdkOt7_Zrxw}{127.0.0.1}{127.0.0.1:9304}
    org.elasticsearch.transport.ConnectTransportException: [][127.0.0.1:9304] 

So what am i missing with the EC2 discovery plugin that is causing the cluster to find "localhost" and repeatedly ping it?

It looks like this exact issue was actually broached here, but this thread has no solution.

Here's the (slightly edited) config for this particular master node:

bootstrap.memory_lock: true
node.name: ${HOSTNAME}

action.destructive_requires_name: true
# default is unbounded
indices.fielddata.cache.size: 1% 

cluster.name: BlahCluster
discovery.zen.minimum_master_nodes: 2

# only data nodes should have ingest and http capabilities
node.master: true
node.data: false
node.ingest: false
http.enabled: false
xpack.security.enabled: true
xpack.monitoring.enabled: true
path.data: /mnt/blah
path.logs: /mnt/blah

network.host: _ec2:privateIpv4_
plugin.mandatory: discovery-ec2

cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone

discovery:
    zen.hosts_provider: ec2
    ec2.groups: sg-BLAH
    ec2.host_type: private_ip
    ec2.tag.Cluster: SomeClusterTag
    ec2.availability_zones: us-west-1b,us-west-1c
    ec2.protocol: https

I have confirmed that SomeClusterTag is on every node I would expect to see in the cluster.
I have confirmed that the security group sg-BLAH is on every node and allows port 9200 through 9400.
I know that there is no problem with inter-node communications because when I add the hosts manually, the cluster has no problem assembling:

discovery.zen.ping.unicast.hosts: [
"172.25.56.186",
"172.25.48.249",
"172.25.38.47",
"172.25.51.137",
"172.25.34.128"]

So what am i missing?

The pinging of localhost is a diversion, I think. That's the default behaviour if discovery.zen.ping.unicast.hosts is unset. You can suppress it by setting it to the empty list instead:

discovery.zen.ping.unicast.hosts: []

The real issue lies between these two lines:

[2018-12-11T00:57:41,477][TRACE][o.e.d.e.AwsEc2UnicastHostsProvider] [ip-172-25-34-206] building dynamic unicast discovery nodes...
[2018-12-11T00:57:41,478][DEBUG][o.e.d.e.AwsEc2UnicastHostsProvider] [ip-172-25-34-206] using dynamic transport addresses []

The first line is reported here:

After that Elasticsearch loops through all the instances returned by the DescribeInstances call and logs various things at various points, before finally logging using dynamic transport addresses .... That there is nothing logged between these two lines suggests that nothing is returned by EC2.

Can you construct an analogous DescribeInstances call and verify that the right instances are genuinely returned? The relevant code is here:

@DavidTurner, thank you for the very quick reply. it is apreciated.

It turns out, you're right about the red herring. Thank you for that!

Also, thank you for linking me to the ES code. There's a lot of syntax that's not familiar to me (java 6 was recently released the last time i did any java dev...), but i think i have been able to parse out what the equivalent query is:

root@ip-172-25-60-197:/etc/elasticsearch# aws ec2 describe-instances --filters "Name=instance-state-name,Values=running,pending"  "Name=tag:Cluster,Values=prototype.dev-domain.com-ElasticSearchCluster" --region us-west-1 --query "Reservations[*].Instances[*].[InstanceId,"SecurityGroups"[*]]" | jq
[
  [
    [
      "i-12345678cd990ed6b",
      [
        {
          "GroupName": "elasticsearch-ElasticSearchCluster-security-group",
          "GroupId": "sg-1234567816fa7017b"
        }
      ]
    ]
  ],
  [
    [
      "i-12345678009741f88",
      [
        {
          "GroupName": "elasticsearch-ElasticSearchCluster-security-group",
          "GroupId": "sg-1234567816fa7017b"
        }
      ]
    ]
  ],
  [
    [
      "i-12345678422b5884e",
      [
        {
          "GroupName": "elasticsearch-ElasticSearchCluster-security-group",
          "GroupId": "sg-1234567816fa7017b"
        },
        {
          "GroupName": "elasticsearch-ElasticSearchCluster-clients-security-group",
          "GroupId": "sg-12345678f5ada8419"
        }
      ]
    ]
  ],
  [
    [
      "i-1234567897d5da83b",
      [
        {
          "GroupName": "elasticsearch-ElasticSearchCluster-security-group",
          "GroupId": "sg-1234567816fa7017b"
        }
      ]
    ]
  ],
  [
    [
      "i-123456785d2225990",
      [
        {
          "GroupName": "elasticsearch-ElasticSearchCluster-security-group",
          "GroupId": "sg-1234567816fa7017b"
        }
      ]
    ],
    [
      "i-12345678cb85a5055",
      [
        {
          "GroupName": "elasticsearch-ElasticSearchCluster-security-group",
          "GroupId": "sg-1234567816fa7017b"
        }
      ]
    ]
  ]
]

There are 6 unique instance IDs in the results. my desired cluster has 6 nodes.

Here's the configuration on the same node that ran the aws ec2 describe-instances call:

root@ip-172-25-60-197:/etc/elasticsearch# cat elasticsearch.yml
bootstrap.memory_lock: true
node.name: ${HOSTNAME}

action.destructive_requires_name: true
# default is unbounded
indices.fielddata.cache.size: 1% 
cluster.name: ElasticSearchCluster

discovery.zen.minimum_master_nodes: 2

# only data nodes should have ingest and http capabilities
node.master: false
node.data: false
node.ingest: false
http.enabled: true
xpack.security.enabled: false
xpack.monitoring.enabled: true
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch

network.host: _ec2:privateIpv4_,localhost
plugin.mandatory: discovery-ec2

# node can be made aware of EC2 info
# See: https://www.elastic.co/guide/en/elasticsearch/plugins/6.5/_settings.html#discovery-ec2-attributes
cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone

discovery.zen.ping.unicast.hosts: []

logger.org.elasticsearch.discovery: trace

discovery:
    zen.hosts_provider: ec2
    # ec2.groups: sg-1234567816fa7017b
    ec2.any_group: true
    ec2.host_type: private_ip
    ec2.tag.Cluster: prototype.dev-domain.com-ElasticSearchCluster
    ec2.availability_zones: us-west-1b,us-west-1c
    # Why would you use anything but?
    ec2.protocol: https

Notice that sg-1234567816fa7017b is the sg that the plugin is configured to use and is attached to all of the 6 nodes in my query string.

Something did just occur to me: in my query:

aws ec2 describe-instances --filters "Name=instance-state-name,Values=running,pending"  "Name=tag:Cluster,Values=prototype.dev-domain.com-ElasticSearchCluster" --region us-west-1 --query "Reservations[*].Instances[*].[InstanceId,"SecurityGroups"[*]]" | jq

I do not pass in any credentials. This is because I am running the aws binary on a computer that lives in AWS and has been blessed with an instance profile:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ec2Read",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeTags",
                "ec2:DescribeInstances"
            ],
            "Resource": "*"
        }
    ]
}

Is there something i need to do to get the discovery plugin to use the instance profile?

So i have some interesting developments. It turns out that IAM profile is working fine, as i do get a proper X-Amz-Security-token.

I set logger.org.apache.http.wire: trace in my config, hoping to see what the AWS API was returning.

Here's the gist after a bit of cleanup

http-outgoing-0 >> "POST / HTTP/1.1[\r][\n]"
http-outgoing-0 >> "Host: ec2.us-east-1.amazonaws.com[\r][\n]"
http-outgoing-0 >> "Authorization: AWS4-HMAC-SHA256 Credential=LOL_NOPE/20181211/us-east-1/ec2/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;host;user-agent;x-amz-date;x-amz-security-token, Signature=LOL_NOPE[\r][\n]"
http-outgoing-0 >> "X-Amz-Date: 20181211T213956Z[\r][\n]"
http-outgoing-0 >> "User-Agent: aws-sdk-java/1.11.187 Linux/4.15.0-1029-aws Java_HotSpot(TM)_64-Bit_Server_VM/25.191-b12/1.8.0_191[\r][\n]"
http-outgoing-0 >> "X-Amz-Security-Token: LOL_NOPE[\r][\n]"
http-outgoing-0 >> "amz-sdk-invocation-id: 832SOME_UUID8[\r][\n]"
http-outgoing-0 >> "amz-sdk-retry: 0/0/500[\r][\n]"
http-outgoing-0 >> "Content-Type: application/x-www-form-urlencoded; charset=utf-8[\r][\n]"
http-outgoing-0 >> "Content-Length: 304[\r][\n]"
http-outgoing-0 >> "Connection: Keep-Alive[\r][\n]"
http-outgoing-0 >> "[\r][\n]"
http-outgoing-0 >> "Action=DescribeInstances&Version=2016-11-15&Filter.1.Name=instance-state-name&Filter.1.Value.1=running&Filter.1.Value.2=pending&Filter.2.Name=tag%3ACluster&Filter.2.Value.1=prototype.dev-domain.com-ElasticSearchCluster&Filter.3.Name=availability-zone&Filter.3.Value.1=us-west-1b&Filter.3.Value.2=us-west-1c"
http-outgoing-0 << "HTTP/1.1 200 OK[\r][\n]"
http-outgoing-0 << "Content-Type: text/xml;charset=UTF-8[\r][\n]"
http-outgoing-0 << "Content-Length: 230[\r][\n]"
http-outgoing-0 << "Date: Tue, 11 Dec 2018 21:39:56 GMT[\r][\n]"
http-outgoing-0 << "Server: AmazonEC2[\r][\n]"
http-outgoing-0 << "[\r][\n]"
http-outgoing-0 << "<?xml version="1.0" encoding="UTF-8"?>[\n]"
http-outgoing-0 << "<DescribeInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">[\n]"
http-outgoing-0 << "    <requestId>SOME_UUID4</requestId>[\n]"
http-outgoing-0 << "    <reservationSet/>[\n]"
http-outgoing-0 << "</DescribeInstancesResponse>"
[2018-12-11T21:40:00,299][WARN ][o.e.d.z.ZenDiscovery     ] [ip-172-25-60-197] not enough master nodes discovered during pinging (found [[]], but needed [2]), pinging again

That is an empty DescribeInstancesResponse set. That does explain why this cluster wont come up.
The search query that the plugin sends is (functionally) identical to the one that i send.

I can replicate this result:

aws ec2 describe-instances --filters "Name=instance-state-name,Values=running,pending"  "Name=tag:Cluster,Values=prototype.dev-domain.com-ElasticSearchCluster" --region us-east-1 --query "Reservations[*].Instances[*].[InstanceId,"SecurityGroups"[*]]"
[]

So here's the kicker. I noticed the Host header....

Host: ec2.us-east-1.amazonaws.com[\r][\n]"

See that us-east-1? That's wrong. None of the subnets i have live there. None of the security groups that I am using live there. The manual aws ec2 call i issued didn't have a --region us-east-1...

But now i am really confused, as i see this in the logs:

[2018-12-11T00:57:29,380][DEBUG][o.e.d.e.Ec2DiscoveryPlugin] [ip-172-25-34-206] obtaining ec2 [placement/availability-zone] from ec2 meta-data url http://169.254.169.254/latest/meta-data/placement/availability-zone

Let's go take a look at what this returns, shall we?

root@ip-172-25-60-197:/etc/elasticsearch# curl -vvv http://169.254.169.254/latest/meta-data/placement/availability-zone
*   Trying 169.254.169.254...
* TCP_NODELAY set
* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /latest/meta-data/placement/availability-zone HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.58.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Accept-Ranges: none
< Last-Modified: Tue, 11 Dec 2018 20:50:39 GMT
< Content-Length: 10
< Date: Tue, 11 Dec 2018 21:46:54 GMT
< Server: EC2ws
< Connection: close
<
* Closing connection 0
us-west-1c

I get us-west-1c.

So what the hell, elastic search? you clearly know you're not in us-east-1, curl http://169.254.169.254/latest/meta-data/placement/availability-zone makes that super clear.

@DavidTurner, have I stumbled into a bug / regression?

From: https://www.elastic.co/guide/en/elasticsearch/plugins/6.5/_settings.html#discovery-ec2-attributes

endpoint - This will be *automatically figured out by the ec2 client based on the instance location*, but can be specified explicitly. 

This is not accurate, as the endpoint is using the wrong region above ^.

And another good bit of news! As soon as i manually specified discovery.ec2.endpoint: ec2.us-west-1.amazonaws.com in my config, the cluster assembled!

So for anybody else that finds this thread, the current 6.5 documentation lies. If you are operating a cluster in any region other than us-east-1 you will need to specify the proper discovery.ec2.endpoint for your region, despite the "automatic" claims in the documentation. This documentation has been inaccurate for more than 13 months, at this point! https://github.com/elastic/elasticsearch/issues/27464

[2018-12-11T22:01:57,096][INFO ][o.e.c.s.ClusterApplierService] [ip-172-25-60-197] added {{ip-172-25-48-186}{xnXOhyGqScihxgV8MLgoHg}{cjMGm5fESn-aO9okh97yXQ}{172.25.48.186}{172.25.48.186:9300}{aws_availability_zone=us-west-1c, ml.machine_memory=3885350912, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {ip-172-25-46-202}{U7MGS4kZT5OOQBqXBwiNbg}{zVRMl53eQuaNzDdJ6pm7RA}{172.25.46.202}{172.25.46.202:9300}{aws_availability_zone=us-west-1b, ml.machine_memory=3885350912, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [15]])

I'm sorry you hit this issue. There's a pending PR about this documentation bug.

@dadoonet thanks for the follow up.

As i was opening a GH issue to reflect this thread, i noticed that there was an existing one:

The docs have been inaccurate for 13 months now. Inaccurate docs are a fact of life; its the 13 months that it's been known and not even a "we're sorry, please ignore the above and see this issue for more..." could get posted. Here's to hoping that other folks will stumble on this one via google until the PR is merged or the eventual heat-death of the universe. whichever comes first.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.