Cloud-aws plugin not able to join cluster

So i have been attempting to use the ElasticSearch "cloud-aws" plugin to join elasticsearch nodes to my single master. I have been though a few online guides and tried a few settings from various sources but I still can't get the new nodes to join the existing master.

I have configured IAMS roles and tags for EC2 and this is my elasticsearch.yml file on one node (the others are similar):

node.name: Thor
node.client: "true"
network.host: localhost
cloud:aws:access_key: foobar
cloud:aws:secret_key: barfoo
cloud:aws:region: eu-west-1
discovery:type: ec2
discovery.ec2.tag.elasticsearch: Ubuntu-ElasticNode

The logging from elasticsearch is not very helpful and even in DEBUG mode not much is offered up.

[2016-03-15 23:01:05,440][INFO ][node                     ] [Thor] version[2.2.0], pid[1550], build[8ff36d1/2016-01-27T13:32:39Z]
[2016-03-15 23:01:05,447][INFO ][node                     ] [Thor] initializing ...
[2016-03-15 23:01:06,685][INFO ][plugins                  ] [Thor] modules     [lang-expression, lang-groovy], plugins [cloud-aws], sites []
[2016-03-15 23:01:10,016][INFO ][node                     ] [Thor] initialized
[2016-03-15 23:01:10,017][INFO ][node                     ] [Thor] starting ...
[2016-03-15 23:01:10,106][INFO ][transport                ] [Thor] publish_address {localhost/127.0.0.1:9300}, bound_addresses {127.0.0.1:9300}
[2016-03-15 23:01:10,115][INFO ][discovery                ] [Thor]   elasticsearch/9PmYq5tXQcaPUPqDh4VTSQ
[2016-03-15 23:01:40,116][WARN ][discovery                ] [Thor] waited for 30s and no initial state was set by the discovery
[2016-03-15 23:01:40,155][INFO ][http                     ] [Thor] publish_address {localhost/127.0.0.1:9200}, bound_addresses {127.0.0.1:9200}
[2016-03-15 23:01:40,155][INFO ][node                     ] [Thor] started
[2016-03-15 23:54:40,863][DEBUG][action.admin.cluster.health] [Thor] no known master node, scheduling a retry
[2016-03-15 23:55:10,864][DEBUG][action.admin.cluster.health] [Thor] timed out while retrying    [cluster:monitor/health] after failure (timeout [30s])
[2016-03-15 23:55:10,874][INFO ][rest.suppressed          ] /_cluster/health  Params: {pretty=}
MasterNotDiscoveredException[null]
at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$5.onTimeout(TransportMasterNodeAction.java:205)
at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:239)
at org.elasticsearch.cluster.service.InternalClusterService$NotifyTimeout.run(InternalClusterService.java:794)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

I have the port range 9200 - 9400 open between the elasticsearch servers but the log seems to indicate that the discovery is still timing out. I set "discovery.ec2.tag.*" to speed up the discovery process but this hasn't helped.

Does anyone have any idea how this plugin needs to be configured? I have read a few guides that use even less configuration options than I have and are still able to join nodes to the master.

I believe you need to change network.host. Your nodes won't be able to communicate with other instances otherwise.

Read this: https://www.elastic.co/guide/en/elasticsearch/plugins/current/cloud-aws-discovery.html#cloud-aws-discovery-network-host

Thanks, so i updated the elasticsearch.yml on the node to:

cluster.name: elasticsearch
node.name: Supergirl
node.client: "true"
network.host: _ec2_
cloud.aws.access_key: key
cloud.aws.secret_key: secret
cloud.aws.region: eu-west-1
discovery.type: ec2
discovery.ec2.tag.elasticsearch: Ubuntu-ElasticNode
discovery.zen.minimum_master_nodes: 2

Then I updated the elasticesearch.yml on the master to:

plugin.mandatory: cloud-aws
cluster.name: elasticsearch
network.host: ec2
cloud.aws.access_key: key
cloud.aws.secret_key: secret
cloud.aws.region: eu-west-1
discovery.type: ec2
discovery.ec2.tag.elasticsearch: Ubuntu-ElasticNode

The node joined the master but then i noticed that nginx, kibana and logstash were broken on the master. They had references to localhost which I updated to the masters ec2 ip in the nginx default file, the kibana.yml and the logstash output config file but now i am getting this error when trying to join the elasticsearch node to the master:

[2016-03-18 00:48:54,378][INFO ][discovery.ec2            ] [Supergirl] failed to send join request to master [{In-Betweener}.........
........
java.lang.IllegalStateException: Message not fully read (request) for requestId [4420], action [internal:discovery/zen/join/validate], readerIndex [44918] vs expected [54082]; resetting

You have a Node or Java Client which probably is not using the elasticsearch version.

What are your LS version and config file?w

I suspected that i had different versions of elasticsearch because i remember holding the master at 2.2.0 but the new node is running 2.2.1.

My Logstash version is:

 logstash:
  Installed: 1:2.2.2-1 

The config file looks like this:

/etc/logstash/conf.d/30-elasticsearch-output.conf

output {
  elasticsearch {
    hosts => ["172.31.31.4:9200"]
    sniffing => true
    manage_template => false
    index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
    document_type => "%{[@metadata][type]}"
  }
}

So LS is fine here. You are using the HTTP connector which is perfect.

Mixing 2.2.0 with 2.2.1 should not be an issue and actually whatever 2.x version it is.

Unsure where this is coming from...

Interesting, not sure if this will help anyone else but this is the stack trace on the node:

[2016-03-18 10:20:36,078][INFO ][discovery.ec2            ] [Supergirl] failed to send join request to master      [{George Washington Bridge}{txAxO29VSoiIMu0VKmvd4g}{172.31.31.4}{172.31.31.4:9300}], reason     [RemoteTransportException[[George Washington Bridge][172.31.31.4:9300][internal:discovery/zen/join]];     nested: IllegalStateException[failure when sending a validation request to node]; nested:   RemoteTransportException[[Supergirl][172.31.28.115:9300][internal:discovery/zen/join/validate]]; nested:   IllegalArgumentException[No custom metadata prototype registered for type [licenses], node like missing  plugins]; ]
[2016-03-18 10:20:39,099][WARN ][transport.netty          ] [Supergirl] exception caught on transport layer [[id:  0x3e11e0a0, /172.31.31.4:45449 => /172.31.28.115:9300]], closing connection
java.lang.IllegalStateException: Message not fully read (request) for requestId [5953], action    [ internal:discovery/zen/join/validate], readerIndex [45509] vs expected [54685]; resetting
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:120)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:310)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:75)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

This is the output showing in the elasticsearch log on the master:

[2016-03-18 10:22:55,932][WARN ][discovery.ec2            ] [George Washington Bridge] failed to validate incoming join request from node [{Supergirl}{hrse7OHMTK-aAJGl3CX8Ew}{172.31.28.115}{172.31.28.115:9300}{client=true, data=false}]

I might try spinning up a new node in AWS later to see if a clean install helps.

Which plugins did you install? Are you missing one?

On the master i have clould-aws and marvel and on the node just cloud-aws.

why you don't have marvel on data nodes?

I think that at the very least you need to install the license plugin on each node.

I just haven't got to the marvel plugin yet. I wanted to get cloud-aws working first.

I don't remember installing a license plugin on the master but I can see it on the master. What is the license plugins purpose and where do i get it? I seem to get directed to a Shield plugin when i 'google' it.

Okay, i found the details on the license plugin in the Marvel docs and installed both but not sure if they will need any further configuration.

So I built a new node and this time installed Java 7 (instead of downgrading from 8 to 7 as per the last node), installed marvel-agent, license and cloud-aws and this was the result:

[2016-03-18 11:41:45,410][INFO ][node                     ] [Supergirl Returns] version[2.2.0], pid[1265], build[8ff36d1/2016-01-27T13:32:39Z]
[2016-03-18 11:41:45,411][INFO ][node                     ] [Supergirl Returns] initializing ...
[2016-03-18 11:41:46,738][INFO ][plugins                  ] [Supergirl Returns] modules [lang-groovy, lang-expression], plugins [marvel-agent, cloud-aws, license], sites []
[2016-03-18 11:41:49,807][INFO ][node                     ] [Supergirl Returns] initialized
[2016-03-18 11:41:49,807][INFO ][node                     ] [Supergirl Returns] starting ...
[2016-03-18 11:41:49,910][INFO ][transport                ] [Supergirl Returns] publish_address {172.31.23.200:9300}, bound_addresses {172.31.23.200:9300}
[2016-03-18 11:41:49,919][INFO ][discovery                ] [Supergirl Returns] elasticsearch/XQz2-7mlQYK64RWp5fGrcA
[2016-03-18 11:41:53,957][INFO ][cluster.service          ] [Supergirl Returns] detected_master {George Washington Bridge}{txAxO29VSoiIMu0VKmvd4g}{172.31.31.4}{172.31.31.4:9300}, added {{George Washington Bridge}{txAxO29VSoiIMu0VKmvd4g}{172.31.31.4}{172.31.31.4:9300},}, reason: zen-disco-receive(from master  [{George Washington Bridge}{txAxO29VSoiIMu0VKmvd4g}{172.31.31.4}{172.31.31.4:9300}])
[2016-03-18 11:41:53,999][INFO ][license.plugin.core      ] [Supergirl Returns] license [d07edc1f-a44b-4201-b5b6-5d377d397c4c] - valid
[2016-03-18 11:41:54,001][ERROR][license.plugin.core      ] [Supergirl Returns]
#
# License will expire on [Monday, April 11, 2016]. If you have a new license, please update it.
# Otherwise, please reach out to your support contact.
#
# Commercial plugins operate with reduced functionality on license expiration:
# - marvel
#  - The agent will stop collecting cluster and indices metrics
[2016-03-18 11:41:54,008][INFO ][http                     ] [Supergirl Returns] publish_address {172.31.23.200:9200}, bound_addresses {172.31.23.200:9200}
[2016-03-18 11:41:54,009][INFO ][node                     ] [Supergirl Returns] started

So 'Supergirl Returns' has joined the cluster and i can see this from both Marvel and curl cluster health but I don't see any replication yet but I need to go out so i will check the health when i get back.

Thank you for your invaluable input and assistance @dadoonet.