EC2 VPC Discovery Problems

Hi Guys,

I've recently spun up a new ES cluster (2 nodes) to replace my old stack.
The new cluster is using the newest ES version (0.20.4) and newest AWS
plugin (1.10.0) and is backing to S3. I cannot get them to talk to each
other now and it's making me crazy. I'm using two Ubuntu 12.04 EC2
instances inside of a VPC. The nodes are in the same region, zone and
subnet and the security groups for both the LB and EC2 instances both have
9200 and 9300 open to the world.

Here is a copy of their config files:

------- NODE 1 -------

################################### Cluster
###################################

cluster.name: wtr_search_production
node.name: wtr-search-watchtower-i-70776729

#################################### Index
####################################

index.number_of_shards: 6
index.number_of_replicas: 1

#################################### Paths
####################################

path.data: /usr/local/var/data/elasticsearch
path.logs: /usr/local/var/log/elasticsearch

################################### Memory
####################################

bootstrap.mlockall: true

################################## Discovery
##################################
discovery.type: ec2
discovery.ec2.availability_zones: us-west-1a
discovery.ec2.ping_timeout: 30s
discovery.ec2.tag: wtrsearch_production

cloud.node.auto_attributes: true
cloud.aws.region: us-west-1
cloud.aws.access_key: [omitted]
cloud.aws.secret_key: [omitted]

################################## Persistence
##################################
gateway.type: s3
gateway.s3.bucket: wtr-search-data-us-west-1-test

------- NODE 2 -------

################################### Cluster
###################################

cluster.name: wtr_search_production
node.name: wtr-search-watchtower-i-e27666bb

#################################### Index
####################################

index.number_of_shards: 6
index.number_of_replicas: 1

#################################### Paths
####################################

path.data: /usr/local/var/data/elasticsearch
path.logs: /usr/local/var/log/elasticsearch

################################### Memory
####################################

bootstrap.mlockall: true

################################## Discovery
##################################
discovery.type: ec2
discovery.ec2.availability_zones: us-west-1a
discovery.ec2.ping_timeout: 30s
discovery.ec2.tag: wtrsearch_production

cloud.node.auto_attributes: true
cloud.aws.region: us-west-1
cloud.aws.access_key: [omitted]
cloud.aws.secret_key: [omitted]

################################## Persistence
##################################
gateway.type: s3
gateway.s3.bucket: wtr-search-data-us-west-1-test


Can anyone think of any reason why these two wouldn't be able to talk to
one another? I'm absolutely pulling my hair out.

They both have cluster health outputs that look like this:

{

  • cluster_name: wtr_search_production
  • status: green
  • timed_out: false
  • number_of_nodes: 1
  • number_of_data_nodes: 1
  • active_primary_shards: 0
  • active_shards: 0
  • relocating_shards: 0
  • initializing_shards: 0
  • unassigned_shards: 0

}

They both have cluster state outputs that look like this:

{

  • cluster_name: wtr_search_production
  • master_node: HPscfXXMQNqSIuXA3b6MLQ
  • blocks: { }
  • nodes: {
    • HPscfXXMQNqSIuXA3b6MLQ: {
      • name: wtr-search-watchtower-i-70776729
      • transport_address: inet[/10.200.60.18:9300]
      • attributes: {
        • aws_availability_zone: us-west-1a
          }
          }
          }
  • metadata: {
    • templates: { }
    • indices: { }
      }
  • routing_table: {
    • indices: { }
      }
  • routing_nodes: {
    • unassigned: [ ]
    • nodes: {
      • HPscfXXMQNqSIuXA3b6MLQ: [ ]
        }
        }
  • allocations: [ ]

}

Except with (obviously) different addresses, names and master nodes.

Please help!

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I'm assuming that you checked all the basics:

  • security group
  • acl's
  • local firewall on the box
  • telnetting to port 9300 from one box to another?

I'm having a similar issue right now and everything looks solid except for
the telnet part. Can't seem to get these boxes to talk to each other.

On Thursday, January 31, 2013 5:53:55 PM UTC-8, Nicholas Haggmark wrote:

Hi Guys,

I've recently spun up a new ES cluster (2 nodes) to replace my old stack.
The new cluster is using the newest ES version (0.20.4) and newest AWS
plugin (1.10.0) and is backing to S3. I cannot get them to talk to each
other now and it's making me crazy. I'm using two Ubuntu 12.04 EC2
instances inside of a VPC. The nodes are in the same region, zone and
subnet and the security groups for both the LB and EC2 instances both have
9200 and 9300 open to the world.

Here is a copy of their config files:

------- NODE 1 -------

################################### Cluster
###################################

cluster.name: wtr_search_production
node.name: wtr-search-watchtower-i-70776729

#################################### Index
####################################

index.number_of_shards: 6
index.number_of_replicas: 1

#################################### Paths
####################################

path.data: /usr/local/var/data/elasticsearch
path.logs: /usr/local/var/log/elasticsearch

################################### Memory
####################################

bootstrap.mlockall: true

################################## Discovery
##################################
discovery.type: ec2
discovery.ec2.availability_zones: us-west-1a
discovery.ec2.ping_timeout: 30s
discovery.ec2.tag: wtrsearch_production

cloud.node.auto_attributes: true
cloud.aws.region: us-west-1
cloud.aws.access_key: [omitted]
cloud.aws.secret_key: [omitted]

################################## Persistence
##################################
gateway.type: s3
gateway.s3.bucket: wtr-search-data-us-west-1-test

------- NODE 2 -------

################################### Cluster
###################################

cluster.name: wtr_search_production
node.name: wtr-search-watchtower-i-e27666bb

#################################### Index
####################################

index.number_of_shards: 6
index.number_of_replicas: 1

#################################### Paths
####################################

path.data: /usr/local/var/data/elasticsearch
path.logs: /usr/local/var/log/elasticsearch

################################### Memory
####################################

bootstrap.mlockall: true

################################## Discovery
##################################
discovery.type: ec2
discovery.ec2.availability_zones: us-west-1a
discovery.ec2.ping_timeout: 30s
discovery.ec2.tag: wtrsearch_production

cloud.node.auto_attributes: true
cloud.aws.region: us-west-1
cloud.aws.access_key: [omitted]
cloud.aws.secret_key: [omitted]

################################## Persistence
##################################
gateway.type: s3
gateway.s3.bucket: wtr-search-data-us-west-1-test


Can anyone think of any reason why these two wouldn't be able to talk to
one another? I'm absolutely pulling my hair out.

They both have cluster health outputs that look like this:

{

  • cluster_name: wtr_search_production
  • status: green
  • timed_out: false
  • number_of_nodes: 1
  • number_of_data_nodes: 1
  • active_primary_shards: 0
  • active_shards: 0
  • relocating_shards: 0
  • initializing_shards: 0
  • unassigned_shards: 0

}

They both have cluster state outputs that look like this:

{

  • cluster_name: wtr_search_production
  • master_node: HPscfXXMQNqSIuXA3b6MLQ
  • blocks: { }
  • nodes: {
    • HPscfXXMQNqSIuXA3b6MLQ: {
      • name: wtr-search-watchtower-i-70776729
      • transport_address: inet[/10.200.60.18:9300]
      • attributes: {
        • aws_availability_zone: us-west-1a
          }
          }
          }
  • metadata: {
    • templates: { }
    • indices: { }
      }
  • routing_table: {
    • indices: { }
      }
  • routing_nodes: {
    • unassigned:
    • nodes: {
      • HPscfXXMQNqSIuXA3b6MLQ:
        }
        }
  • allocations:

}

Except with (obviously) different addresses, names and master nodes.

Please help!

Thanks!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Correct.

I'm able to curl the other from each node on 9200 and able to telnet to the
other on 9300. For the purposes of experimentation, i've opened 9200 and
9300 at the ELB and instance security group to any ip. I can hit the load
balancer and it round robins me between the two nodes, they just can't seem
to find one another.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

i would consider editing the logging.yml file one each node and configure
the 'discovery' and 'gateway' logger root to be DEBUG, you may find out
stuff it is trying, but 'discarding' the other node due to specific
logic/attribute values not yet clear.

For example, maybe the 'host_type' property of the plugin (defaulting to
private_ip) may be interfering within your VPC setup, but the extra
detailed logging during the discovery protocol modules may give you insight
into what is/isn't happening.

Paul

On 1 February 2013 13:41, Nicholas Haggmark nhaggmark@gmail.com wrote:

Correct.

I'm able to curl the other from each node on 9200 and able to telnet to
the other on 9300. For the purposes of experimentation, i've opened 9200
and 9300 at the ELB and instance security group to any ip. I can hit the
load balancer and it round robins me between the two nodes, they just can't
seem to find one another.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Well, it appears to be finding it:

[2013-02-01 03:37:56,112][TRACE][discovery.ec2 ]
[wtr-search-watchtower-i-e27666bb] adding i-70776729, address 10.200.60.18,
transport_address inet[/10.200.60.18:9300]

Then it adds it to a list of dynamic discovery nodes (it's in the list, but
it's really long):

[2013-02-01 03:37:56,116][DEBUG][discovery.ec2 ]
[wtr-search-watchtower-i-e27666bb] using dynamic discovery nodes
[[#cloud-i-0e21a948-0][inet[/10.178.30.112:9300]]...

Then it disconnects...

[2013-02-01 03:38:42,542][TRACE][discovery.zen.ping.unicast]
[wtr-search-watchtower-i-e27666bb] [1] disconnecting from
[#cloud-i-70776729-0][inet[/10.200.60.18:9300]]

I wonder what is happening at this point?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If the other node isn't on the list of servers output as :

[2013-02-01 03:54:26,288][TRACE][discovery.zen.ping.unicast]
[wtr-search-watchtower-i-e27666bb] [1] connecting (light) to
[#cloud-i-0e21a948-0][inet[/10.178.30.112:9300]]

Does this mean that it's somehow getting filtered out an is not trying to
connect?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Out of curiosity, is there a size limit to the number of servers it will
attempt to contact? On the list of servers being pulled back (which is
long), the new node is dead last. I'm curious if perhaps it's just not
attempting to connect to it?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

just triple checking, but you have marked these 2 EC2 nodes with the
'wtrsearch_production'
tag via the console or as part of the launch API call right? You've set
the filter by tag, so maybe they see each other, but discount each other
because the other doesn't belong to that tag (because neither does,
maybe..) Just want to triple check that from EC2 point of view, that named
instance Id does appear to come set with that tag?

Paul

On 1 February 2013 15:03, Nicholas Haggmark nhaggmark@gmail.com wrote:

Out of curiosity, is there a size limit to the number of servers it will
attempt to contact? On the list of servers being pulled back (which is
long), the new node is dead last. I'm curious if perhaps it's just not
attempting to connect to it?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sure did.

I got it to work. Basically, I manually went to both nodes and added their
current security group as a filter criteria and restarted. This worked
immediately. It would seem that it only allows for a certain number of
async connection attempts (I think?). This is going to be problematic for
me since I'm using cloud formation and chef to built out and bootstrap
entire stacks. It's possible that I incorrectly used tags previously, so
I'll try that instead. Otherwise, I'll have to hit the AWS api using Ruby
during the chef deploy to name match the security group that was generated
for the instances in the stack. Either way should work.

Thanks!

Nick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Oh, you know what? It's probable that I was misusing the tags. That was
probably part of my problem. I thought that the tags were AWS discovery
plugin specific. I didn't realize they were actually matching against the
EC2 tags. I thought they were just matching against other nodes in ES with
that in their config.

Thanks again!

Nick

On Thursday, January 31, 2013 8:42:13 PM UTC-8, Nicholas Haggmark wrote:

Sure did.

I got it to work. Basically, I manually went to both nodes and added their
current security group as a filter criteria and restarted. This worked
immediately. It would seem that it only allows for a certain number of
async connection attempts (I think?). This is going to be problematic for
me since I'm using cloud formation and chef to built out and bootstrap
entire stacks. It's possible that I incorrectly used tags previously, so
I'll try that instead. Otherwise, I'll have to hit the AWS api using Ruby
during the chef deploy to name match the security group that was generated
for the instances in the stack. Either way should work.

Thanks!

Nick

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

yeah, it's definitely using EC2 specific tags, this is a way for the ES to
quickly filter the node list to check ones worth checking. the EC2 api's
can quickly bring back a list of nodes just with that tag. Imagine a very
large Hadoop cluster, with a corresponding large-ish ES cluster working
together. The ES nodes just want to scan over the list of ES nodes only,
filtering the Hadoop ones out for checking if that host is up or not.

I didn't think the security group filter was required, it's another way of
matching though in a similar way.

glad it worked out for you, good luck with your ES project!

Paul

On 1 February 2013 15:45, Nicholas Haggmark nhaggmark@gmail.com wrote:

Oh, you know what? It's probable that I was misusing the tags. That was
probably part of my problem. I thought that the tags were AWS discovery
plugin specific. I didn't realize they were actually matching against the
EC2 tags. I thought they were just matching against other nodes in ES with
that in their config.

Thanks again!

Nick

On Thursday, January 31, 2013 8:42:13 PM UTC-8, Nicholas Haggmark wrote:

Sure did.

I got it to work. Basically, I manually went to both nodes and added
their current security group as a filter criteria and restarted. This
worked immediately. It would seem that it only allows for a certain number
of async connection attempts (I think?). This is going to be problematic
for me since I'm using cloud formation and chef to built out and bootstrap
entire stacks. It's possible that I incorrectly used tags previously, so
I'll try that instead. Otherwise, I'll have to hit the AWS api using Ruby
during the chef deploy to name match the security group that was generated
for the instances in the stack. Either way should work.

Thanks!

Nick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Many thanks!

On Thursday, January 31, 2013 8:47:52 PM UTC-8, tallpsmith wrote:

yeah, it's definitely using EC2 specific tags, this is a way for the ES to
quickly filter the node list to check ones worth checking. the EC2 api's
can quickly bring back a list of nodes just with that tag. Imagine a very
large Hadoop cluster, with a corresponding large-ish ES cluster working
together. The ES nodes just want to scan over the list of ES nodes only,
filtering the Hadoop ones out for checking if that host is up or not.

I didn't think the security group filter was required, it's another way of
matching though in a similar way.

glad it worked out for you, good luck with your ES project!

Paul

On 1 February 2013 15:45, Nicholas Haggmark <nhag...@gmail.com<javascript:>

wrote:

Oh, you know what? It's probable that I was misusing the tags. That was
probably part of my problem. I thought that the tags were AWS discovery
plugin specific. I didn't realize they were actually matching against the
EC2 tags. I thought they were just matching against other nodes in ES with
that in their config.

Thanks again!

Nick

On Thursday, January 31, 2013 8:42:13 PM UTC-8, Nicholas Haggmark wrote:

Sure did.

I got it to work. Basically, I manually went to both nodes and added
their current security group as a filter criteria and restarted. This
worked immediately. It would seem that it only allows for a certain number
of async connection attempts (I think?). This is going to be problematic
for me since I'm using cloud formation and chef to built out and bootstrap
entire stacks. It's possible that I incorrectly used tags previously, so
I'll try that instead. Otherwise, I'll have to hit the AWS api using Ruby
during the chef deploy to name match the security group that was generated
for the instances in the stack. Either way should work.

Thanks!

Nick

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.