So i have some interesting developments. It turns out that IAM profile is working fine, as i do get a proper X-Amz-Security-token.
I set logger.org.apache.http.wire: trace in my config, hoping to see what the AWS API was returning.
Here's the gist after a bit of cleanup
http-outgoing-0 >> "POST / HTTP/1.1[\r][\n]"
http-outgoing-0 >> "Host: ec2.us-east-1.amazonaws.com[\r][\n]"
http-outgoing-0 >> "Authorization: AWS4-HMAC-SHA256 Credential=LOL_NOPE/20181211/us-east-1/ec2/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;host;user-agent;x-amz-date;x-amz-security-token, Signature=LOL_NOPE[\r][\n]"
http-outgoing-0 >> "X-Amz-Date: 20181211T213956Z[\r][\n]"
http-outgoing-0 >> "User-Agent: aws-sdk-java/1.11.187 Linux/4.15.0-1029-aws Java_HotSpot(TM)_64-Bit_Server_VM/25.191-b12/1.8.0_191[\r][\n]"
http-outgoing-0 >> "X-Amz-Security-Token: LOL_NOPE[\r][\n]"
http-outgoing-0 >> "amz-sdk-invocation-id: 832SOME_UUID8[\r][\n]"
http-outgoing-0 >> "amz-sdk-retry: 0/0/500[\r][\n]"
http-outgoing-0 >> "Content-Type: application/x-www-form-urlencoded; charset=utf-8[\r][\n]"
http-outgoing-0 >> "Content-Length: 304[\r][\n]"
http-outgoing-0 >> "Connection: Keep-Alive[\r][\n]"
http-outgoing-0 >> "[\r][\n]"
http-outgoing-0 >> "Action=DescribeInstances&Version=2016-11-15&Filter.1.Name=instance-state-name&Filter.1.Value.1=running&Filter.1.Value.2=pending&Filter.2.Name=tag%3ACluster&Filter.2.Value.1=prototype.dev-domain.com-ElasticSearchCluster&Filter.3.Name=availability-zone&Filter.3.Value.1=us-west-1b&Filter.3.Value.2=us-west-1c"
http-outgoing-0 << "HTTP/1.1 200 OK[\r][\n]"
http-outgoing-0 << "Content-Type: text/xml;charset=UTF-8[\r][\n]"
http-outgoing-0 << "Content-Length: 230[\r][\n]"
http-outgoing-0 << "Date: Tue, 11 Dec 2018 21:39:56 GMT[\r][\n]"
http-outgoing-0 << "Server: AmazonEC2[\r][\n]"
http-outgoing-0 << "[\r][\n]"
http-outgoing-0 << "<?xml version="1.0" encoding="UTF-8"?>[\n]"
http-outgoing-0 << "<DescribeInstancesResponse xmlns="http://ec2.amazonaws.com/doc/2016-11-15/">[\n]"
http-outgoing-0 << " <requestId>SOME_UUID4</requestId>[\n]"
http-outgoing-0 << " <reservationSet/>[\n]"
http-outgoing-0 << "</DescribeInstancesResponse>"
[2018-12-11T21:40:00,299][WARN ][o.e.d.z.ZenDiscovery ] [ip-172-25-60-197] not enough master nodes discovered during pinging (found [[]], but needed [2]), pinging again
That is an empty DescribeInstancesResponse set. That does explain why this cluster wont come up.
The search query that the plugin sends is (functionally) identical to the one that i send.
I can replicate this result:
aws ec2 describe-instances --filters "Name=instance-state-name,Values=running,pending" "Name=tag:Cluster,Values=prototype.dev-domain.com-ElasticSearchCluster" --region us-east-1 --query "Reservations[*].Instances[*].[InstanceId,"SecurityGroups"[*]]"
[]
So here's the kicker. I noticed the Host header....
Host: ec2.us-east-1.amazonaws.com[\r][\n]"
See that us-east-1? That's wrong. None of the subnets i have live there. None of the security groups that I am using live there. The manual aws ec2 call i issued didn't have a --region us-east-1...
But now i am really confused, as i see this in the logs:
[2018-12-11T00:57:29,380][DEBUG][o.e.d.e.Ec2DiscoveryPlugin] [ip-172-25-34-206] obtaining ec2 [placement/availability-zone] from ec2 meta-data url http://169.254.169.254/latest/meta-data/placement/availability-zone
Let's go take a look at what this returns, shall we?
root@ip-172-25-60-197:/etc/elasticsearch# curl -vvv http://169.254.169.254/latest/meta-data/placement/availability-zone
* Trying 169.254.169.254...
* TCP_NODELAY set
* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /latest/meta-data/placement/availability-zone HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.58.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Content-Type: text/plain
< Accept-Ranges: none
< Last-Modified: Tue, 11 Dec 2018 20:50:39 GMT
< Content-Length: 10
< Date: Tue, 11 Dec 2018 21:46:54 GMT
< Server: EC2ws
< Connection: close
<
* Closing connection 0
us-west-1c
I get us-west-1c.
So what the hell, elastic search? you clearly know you're not in us-east-1, curl http://169.254.169.254/latest/meta-data/placement/availability-zone makes that super clear.
@DavidTurner, have I stumbled into a bug / regression?
From: https://www.elastic.co/guide/en/elasticsearch/plugins/6.5/_settings.html#discovery-ec2-attributes
endpoint - This will be *automatically figured out by the ec2 client based on the instance location*, but can be specified explicitly.
This is not accurate, as the endpoint is using the wrong region above ^.