Upgrading to 0.16.2 from 0.15.2 - EC2 discovery problem appears (previously worked)

I run a 2-node test/reference cluster on EC2, with the 2 nodes in
different availability zones (east 1a and east 1b). Under 0.15.2, this
worked fine.

I am currently upgrading to 0.16.2 (no other changes), and the
discovery now appears only to work between machines on the same
availability zone.

My YML configuration is very simple (just pasted below below since
it's so short):

cluster:
name: infinite-aws
discovery:
type: ec2
cloud:
aws:
access_key:
secret_key:
bootstrap:
mlockall: true

(I tried adding "ec2:availability_zones: us-east-1a,us-
east-1b,us-east-1c", but this didn't make a difference - not
surprisingly since the discovery phase works)

Node A (east 1a) and Node B (east 1b) can both telnet to each other's
private IP addresses via telnet. (But they are on different Class C
subnets, so any broadcasts wouldn't work). The log files indicate that
the 2 nodes find each other using the discovery.ec2 mechanism.

There are no log messages (even on discovery and transport debug)
between the correct list of EC2 nodes being returned (as noted above,
I've confirmed by hand/telnet I can connect to the node:ports listed),
and the "ping responses: {none}" message after which the node declares
itself master.

The only other interesting thing to happen in the log are a set of
"received ping response with no matching id [1]" messages across all
nodes after one node declares itself master. (I saw the other thread
where the problem was the nodes binding themselves to an IPv6 address,
but here the transport log messages indicate the nodes are correctly
binding themselves to 0.0.0.0:9300)

For both nodes, the log confirms they are running elasticsearch/
0.16.2, and (although the log doesn't confirm versioning for the AWS
plugin) I can see the line "Downloading plugin [...] cloud-
aws-0.16.2.zip" on the console of both nodes.

When I started up a Node C on east 1b, Nodes B and C found each other
as expected.

So is EC2 discovery supposed to work across availability sub-zones?
(Or was I taking advantage of an unintended feature in 0.15.2 - though
the presence of discovery.ec2.availability_zones suggests not)?

If so, has anyone seen it work?

I can provide further details as needed, ie if there's not a simple
explanation/the above is too unclear for a quick diagnostic.

Update: when I started it the other way round (first node B then node
A), node A connected to the cluster.

So now I have no idea what's going on....

Update update (last "spam" to this thread, honest): I deleted "node
A" (the one in the different availability zon), created another "node
D" in the same availability zone (with an identical YAML
configuration), and tried to add it to the existing cluster formed by
B and C, but the same problem occurred (discovery worked, no error
messages followed by "node D" electing itself master of a 1-node
cluster)

Any ideas much appreciated...

It might be a timing issue, where it takes longer to connect to the nodes. Try and increasing the default ping timeout (which is 3 seconds): discovery.zen.ping_timeout, and set it to something like 10s.

On Wednesday, June 8, 2011 at 4:19 AM, Alex at Ikanow wrote:

Update update (last "spam" to this thread, honest): I deleted "node
A" (the one in the different availability zon), created another "node
D" in the same availability zone (with an identical YAML
configuration), and tried to add it to the existing cluster formed by
B and C, but the same problem occurred (discovery worked, no error
messages followed by "node D" electing itself master of a 1-node
cluster)

Any ideas much appreciated...

Thanks for the suggestion. Is that parameter definitely right?

When I added:
discovery:
zen:
ping_timeout: 10s
(I also tried 10 for this param)

I still get the debug log message: "[...] discovery.ec2:71 - [NODE]
using initial_ping_timeout [3s]", suggesting that the param didn't
take?

OK, I think I have a theory as to what the problem is .... I have a
bunch of other nodes (like 8) in my EC2 network, which don't run ES.
Depending on how long ES tries to connect to nodes with nothing
running on 9300 (it would have to be longer than 3s in practice,
though not by that much) then it might hit the global 30s limit before
it had got to the nodes in the network that were running ES.

(Of course increasing the ping timeout would be counter-productive in
this case)

This would explain why it worked differently with start and stopping
in different orders (ie if the running node was at the start of the
list the new node would connect, if it was at the end it would fail).

Unfortunately all my closing and recreating nodes while investigating
this has pushed my ES cluster to the start of the EC2 list, so it's
all working fine this morning...

The solution to this candidate explanation would presumably be to
increase this global value (that defaults to 30s) to something higher

  • is there a parameter that lets you do that? I had a quick look at
    the ES site under discovery but couldn't immediately see anything. Of
    course you could add EC2 tags to your ES nodes also.

If this explanation makes sense, it might be worth adding a line to
the EC2 discovery documentation saying that having a large number of
non-ES nodes in the EC2 network can cause this problem and what to
change to alleviate it? (timeout/adding tags/etc)

Thanks for all the help getting 0.16 up and running

Alex

On Jun 8, 5:15 am, Shay Banon shay.ba...@elasticsearch.com wrote:

It might be a timing issue, where it takes longer to connect to the nodes. Try and increasing the default ping timeout (which is 3 seconds): discovery.zen.ping_timeout, and set it to something like 10s.

FWIW, I've never got that parameter to work either. I see the [3s] debug
message also.

-- jim

On Wed, Jun 8, 2011 at 9:36 AM, Alex at Ikanow apiggott@ikanow.com wrote:

Thanks for the suggestion. Is that parameter definitely right?

When I added:
discovery:
zen:
ping_timeout: 10s
(I also tried 10 for this param)

I still get the debug log message: "[...] discovery.ec2:71 - [NODE]
using initial_ping_timeout [3s]", suggesting that the param didn't
take?

OK, I think I have a theory as to what the problem is .... I have a
bunch of other nodes (like 8) in my EC2 network, which don't run ES.
Depending on how long ES tries to connect to nodes with nothing
running on 9300 (it would have to be longer than 3s in practice,
though not by that much) then it might hit the global 30s limit before
it had got to the nodes in the network that were running ES.

(Of course increasing the ping timeout would be counter-productive in
this case)

This would explain why it worked differently with start and stopping
in different orders (ie if the running node was at the start of the
list the new node would connect, if it was at the end it would fail).

Unfortunately all my closing and recreating nodes while investigating
this has pushed my ES cluster to the start of the EC2 list, so it's
all working fine this morning...

The solution to this candidate explanation would presumably be to
increase this global value (that defaults to 30s) to something higher

  • is there a parameter that lets you do that? I had a quick look at
    the ES site under discovery but couldn't immediately see anything. Of
    course you could add EC2 tags to your ES nodes also.

If this explanation makes sense, it might be worth adding a line to
the EC2 discovery documentation saying that having a large number of
non-ES nodes in the EC2 network can cause this problem and what to
change to alleviate it? (timeout/adding tags/etc)

Thanks for all the help getting 0.16 up and running

Alex

On Jun 8, 5:15 am, Shay Banon shay.ba...@elasticsearch.com wrote:

It might be a timing issue, where it takes longer to connect to the
nodes. Try and increasing the default ping timeout (which is 3 seconds):
discovery.zen.ping_timeout, and set it to something like 10s.

If the problem is caused by a large number of EC2 instances, you may be able
to reduce the number of servers ES attempts to connect to by using the
security group and/or tags filters.

*-- jim
*

On Wed, Jun 8, 2011 at 9:36 AM, Alex at Ikanow apiggott@ikanow.com wrote:

Thanks for the suggestion. Is that parameter definitely right?

When I added:
discovery:
zen:
ping_timeout: 10s
(I also tried 10 for this param)

I still get the debug log message: "[...] discovery.ec2:71 - [NODE]
using initial_ping_timeout [3s]", suggesting that the param didn't
take?

OK, I think I have a theory as to what the problem is .... I have a
bunch of other nodes (like 8) in my EC2 network, which don't run ES.
Depending on how long ES tries to connect to nodes with nothing
running on 9300 (it would have to be longer than 3s in practice,
though not by that much) then it might hit the global 30s limit before
it had got to the nodes in the network that were running ES.

(Of course increasing the ping timeout would be counter-productive in
this case)

This would explain why it worked differently with start and stopping
in different orders (ie if the running node was at the start of the
list the new node would connect, if it was at the end it would fail).

Unfortunately all my closing and recreating nodes while investigating
this has pushed my ES cluster to the start of the EC2 list, so it's
all working fine this morning...

The solution to this candidate explanation would presumably be to
increase this global value (that defaults to 30s) to something higher

  • is there a parameter that lets you do that? I had a quick look at
    the ES site under discovery but couldn't immediately see anything. Of
    course you could add EC2 tags to your ES nodes also.

If this explanation makes sense, it might be worth adding a line to
the EC2 discovery documentation saying that having a large number of
non-ES nodes in the EC2 network can cause this problem and what to
change to alleviate it? (timeout/adding tags/etc)

Thanks for all the help getting 0.16 up and running

Alex

On Jun 8, 5:15 am, Shay Banon shay.ba...@elasticsearch.com wrote:

It might be a timing issue, where it takes longer to connect to the
nodes. Try and increasing the default ping timeout (which is 3 seconds):
discovery.zen.ping_timeout, and set it to something like 10s.

The main thing to solve is the concurrency aspect of connection and machines that take time to detect the fact that there is nothing to connect to. I worked on it a bit and hope to have solved it now: Transport: Improve concurrency when connecting to several nodes · Issue #1007 · elastic/elasticsearch · GitHub.

Not sure why the ping timeout is not working, downloaded a fresh install, set the discovery.zen.ping_timeout (to something like 10s) and I see it being picked up.

I agree about the note, we should add it to the documentation. I would say that the recommendation should be to use things like tags to indicate elasticsearch machines / separate clusters, so it won't have to ping too many unrelated machines (though I hope this is fixed now).

On Wednesday, June 8, 2011 at 5:10 PM, James Cook wrote:

If the problem is caused by a large number of EC2 instances, you may be able to reduce the number of servers ES attempts to connect to by using the security group and/or tags filters.

-- jim

On Wed, Jun 8, 2011 at 9:36 AM, Alex at Ikanow <apiggott@ikanow.com (mailto:apiggott@ikanow.com)> wrote:

Thanks for the suggestion. Is that parameter definitely right?

When I added:
discovery:
zen:
ping_timeout: 10s
(I also tried 10 for this param)

I still get the debug log message: "[...] discovery.ec2:71 - [NODE]
using initial_ping_timeout [3s]", suggesting that the param didn't
take?

OK, I think I have a theory as to what the problem is .... I have a
bunch of other nodes (like 8) in my EC2 network, which don't run ES.
Depending on how long ES tries to connect to nodes with nothing
running on 9300 (it would have to be longer than 3s in practice,
though not by that much) then it might hit the global 30s limit before
it had got to the nodes in the network that were running ES.

(Of course increasing the ping timeout would be counter-productive in
this case)

This would explain why it worked differently with start and stopping
in different orders (ie if the running node was at the start of the
list the new node would connect, if it was at the end it would fail).

Unfortunately all my closing and recreating nodes while investigating
this has pushed my ES cluster to the start of the EC2 list, so it's
all working fine this morning...

The solution to this candidate explanation would presumably be to
increase this global value (that defaults to 30s) to something higher

  • is there a parameter that lets you do that? I had a quick look at
    the ES site under discovery but couldn't immediately see anything. Of
    course you could add EC2 tags to your ES nodes also.

If this explanation makes sense, it might be worth adding a line to
the EC2 discovery documentation saying that having a large number of
non-ES nodes in the EC2 network can cause this problem and what to
change to alleviate it? (timeout/adding tags/etc)

Thanks for all the help getting 0.16 up and running

Alex

On Jun 8, 5:15 am, Shay Banon <shay.ba...@elasticsearch.com (mailto:shay.ba...@elasticsearch.com)> wrote:

It might be a timing issue, where it takes longer to connect to the nodes. Try and increasing the default ping timeout (which is 3 seconds): discovery.zen.ping_timeout, and set it to something like 10s.

  • deleted -

I think I see the problem with the timeout parameter. Shouldn't the name of
the parameter be "discovery.ec2.ping_timeout"?

*-- jim
*
On Wed, Jun 8, 2011 at 12:15 PM, Shay Banon shay.banon@elasticsearch.comwrote:

The main thing to solve is the concurrency aspect of connection and
machines that take time to detect the fact that there is nothing to connect
to. I worked on it a bit and hope to have solved it now:
Transport: Improve concurrency when connecting to several nodes · Issue #1007 · elastic/elasticsearch · GitHub.

Not sure why the ping timeout is not working, downloaded a fresh install,
set the discovery.zen.ping_timeout (to something like 10s) and I see it
being picked up.

I agree about the note, we should add it to the documentation. I would say
that the recommendation should be to use things like tags to indicate
elasticsearch machines / separate clusters, so it won't have to ping too
many unrelated machines (though I hope this is fixed now).

On Wednesday, June 8, 2011 at 5:10 PM, James Cook wrote:

If the problem is caused by a large number of EC2 instances, you may be
able to reduce the number of servers ES attempts to connect to by using the
security group and/or tags filters.

*-- jim
*

On Wed, Jun 8, 2011 at 9:36 AM, Alex at Ikanow apiggott@ikanow.comwrote:

Thanks for the suggestion. Is that parameter definitely right?

When I added:
discovery:
zen:
ping_timeout: 10s
(I also tried 10 for this param)

I still get the debug log message: "[...] discovery.ec2:71 - [NODE]
using initial_ping_timeout [3s]", suggesting that the param didn't
take?

OK, I think I have a theory as to what the problem is .... I have a
bunch of other nodes (like 8) in my EC2 network, which don't run ES.
Depending on how long ES tries to connect to nodes with nothing
running on 9300 (it would have to be longer than 3s in practice,
though not by that much) then it might hit the global 30s limit before
it had got to the nodes in the network that were running ES.

(Of course increasing the ping timeout would be counter-productive in
this case)

This would explain why it worked differently with start and stopping
in different orders (ie if the running node was at the start of the
list the new node would connect, if it was at the end it would fail).

Unfortunately all my closing and recreating nodes while investigating
this has pushed my ES cluster to the start of the EC2 list, so it's
all working fine this morning...

The solution to this candidate explanation would presumably be to
increase this global value (that defaults to 30s) to something higher

  • is there a parameter that lets you do that? I had a quick look at
    the ES site under discovery but couldn't immediately see anything. Of
    course you could add EC2 tags to your ES nodes also.

If this explanation makes sense, it might be worth adding a line to
the EC2 discovery documentation saying that having a large number of
non-ES nodes in the EC2 network can cause this problem and what to
change to alleviate it? (timeout/adding tags/etc)

Thanks for all the help getting 0.16 up and running

Alex

On Jun 8, 5:15 am, Shay Banon shay.ba...@elasticsearch.com wrote:

It might be a timing issue, where it takes longer to connect to the
nodes. Try and increasing the default ping timeout (which is 3 seconds):
discovery.zen.ping_timeout, and set it to something like 10s.

Yes!, you nailed it, thats the parameter that should be used.

On Friday, June 10, 2011 at 10:01 PM, James Cook wrote:

I think I see the problem with the timeout parameter. Shouldn't the name of the parameter be "discovery.ec2.ping_timeout"?

-- jim

On Wed, Jun 8, 2011 at 12:15 PM, Shay Banon <shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

The main thing to solve is the concurrency aspect of connection and machines that take time to detect the fact that there is nothing to connect to. I worked on it a bit and hope to have solved it now: Transport: Improve concurrency when connecting to several nodes · Issue #1007 · elastic/elasticsearch · GitHub.

Not sure why the ping timeout is not working, downloaded a fresh install, set the discovery.zen.ping_timeout (to something like 10s) and I see it being picked up.

I agree about the note, we should add it to the documentation. I would say that the recommendation should be to use things like tags to indicate elasticsearch machines / separate clusters, so it won't have to ping too many unrelated machines (though I hope this is fixed now).

On Wednesday, June 8, 2011 at 5:10 PM, James Cook wrote:

If the problem is caused by a large number of EC2 instances, you may be able to reduce the number of servers ES attempts to connect to by using the security group and/or tags filters.

-- jim

On Wed, Jun 8, 2011 at 9:36 AM, Alex at Ikanow <apiggott@ikanow.com (mailto:apiggott@ikanow.com)> wrote:

Thanks for the suggestion. Is that parameter definitely right?

When I added:
discovery:
zen:
ping_timeout: 10s
(I also tried 10 for this param)

I still get the debug log message: "[...] discovery.ec2:71 - [NODE]
using initial_ping_timeout [3s]", suggesting that the param didn't
take?

OK, I think I have a theory as to what the problem is .... I have a
bunch of other nodes (like 8) in my EC2 network, which don't run ES.
Depending on how long ES tries to connect to nodes with nothing
running on 9300 (it would have to be longer than 3s in practice,
though not by that much) then it might hit the global 30s limit before
it had got to the nodes in the network that were running ES.

(Of course increasing the ping timeout would be counter-productive in
this case)

This would explain why it worked differently with start and stopping
in different orders (ie if the running node was at the start of the
list the new node would connect, if it was at the end it would fail).

Unfortunately all my closing and recreating nodes while investigating
this has pushed my ES cluster to the start of the EC2 list, so it's
all working fine this morning...

The solution to this candidate explanation would presumably be to
increase this global value (that defaults to 30s) to something higher

  • is there a parameter that lets you do that? I had a quick look at
    the ES site under discovery but couldn't immediately see anything. Of
    course you could add EC2 tags to your ES nodes also.

If this explanation makes sense, it might be worth adding a line to
the EC2 discovery documentation saying that having a large number of
non-ES nodes in the EC2 network can cause this problem and what to
change to alleviate it? (timeout/adding tags/etc)

Thanks for all the help getting 0.16 up and running

Alex

On Jun 8, 5:15 am, Shay Banon <shay.ba...@elasticsearch.com (mailto:shay.ba...@elasticsearch.com)> wrote:

It might be a timing issue, where it takes longer to connect to the nodes. Try and increasing the default ping timeout (which is 3 seconds): discovery.zen.ping_timeout, and set it to something like 10s.