EC2 discovery config options?


(Matt Paul-2) #1

All,

I've looked through the docs and the user group and can't seem to find
much on this, so forgive me if I missed it. I've got a small cluster
setup (2 nodes) on AWS and I'm using EC2 discovery for the cluster.
All appears to be working fine, but it seems to take a VERY long time
for the cluster health to go green (up to 20 minutes) from a shutdown
and restart of elasticsearch on both nodes. I've tried looking to see
if I can find anything on timeouts and retries (other than the initial
state timeout), but haven't found much. My config on both nodes looks
like (key values intentionally omitted):

Cluster Settings

cluster:
name: mattdev

cloud:
aws:
access_key:
secret_key:

discovery:
type: ec2

gateway:
type: s3
s3:
bucket: elasticsearch.state.matt
expected_nodes: 2

Like I said, everything seems to work fine, it just takes really long
for the cluster to go green, any help or ideas?

thanks


(Shay Banon) #2

When you shutdown the cluster, do you start it on fresh instances? If so, then there is no local data that can be reused and the whole index needs to be downloaded from s3, that might take time. THe index status API gives you some info on the progress of the recovery, so you can verify.

Another option for you to run ES is by using hte (default) local gateway, either running on EBS (for extra persistency) or on local instance file system (but possibly with more replicas).
On Wednesday, April 13, 2011 at 12:51 AM, Matt Paul wrote:
All,

I've looked through the docs and the user group and can't seem to find
much on this, so forgive me if I missed it. I've got a small cluster
setup (2 nodes) on AWS and I'm using EC2 discovery for the cluster.
All appears to be working fine, but it seems to take a VERY long time
for the cluster health to go green (up to 20 minutes) from a shutdown
and restart of elasticsearch on both nodes. I've tried looking to see
if I can find anything on timeouts and retries (other than the initial
state timeout), but haven't found much. My config on both nodes looks
like (key values intentionally omitted):

Cluster Settings

cluster:
name: mattdev

cloud:
aws:
access_key:
secret_key:

discovery:
type: ec2

gateway:
type: s3
s3:
bucket: elasticsearch.state.matt
expected_nodes: 2

Like I said, everything seems to work fine, it just takes really long
for the cluster to go green, any help or ideas?

thanks


(Matt Paul-2) #3

Shay,

I'm restarting it on the same instances. All I am doing is issue a
_shutdown with curl (curl -XPOST "http://localhost:9200/_shutdown"),
waiting for it all to stop (basically instantly), then starting the
elasticsearch script again in the same instance. from that point,
until I get a green on the status has been as long as 20 minutes


(Shay Banon) #4

And how quick do you get to yellow status? Basically, a primary shard and its replica might drift in the index files (but not in content), and might require resync.
On Wednesday, April 13, 2011 at 1:11 AM, Matt Paul wrote:

Shay,

I'm restarting it on the same instances. All I am doing is issue a
_shutdown with curl (curl -XPOST "http://localhost:9200/_shutdown"),
waiting for it all to stop (basically instantly), then starting the
elasticsearch script again in the same instance. from that point,
until I get a green on the status has been as long as 20 minutes


(Matt Paul-2) #5

It takes about 5 1/2 - 6 minutes (pretty consistently) for one of the
nodes to become the master. Once that happens, checking the status on
that node shows yellow. Until the other node joins the cluster, I just
get the "MasterNotDiscovered" error when trying to do any status, etc
on it

On Apr 12, 5:12 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

And how quick do you get to yellow status? Basically, a primary shard and its replica might drift in the index files (but not in content), and might require resync.

On Wednesday, April 13, 2011 at 1:11 AM, Matt Paul wrote:

Shay,

I'm restarting it on the same instances. All I am doing is issue a
_shutdown with curl (curl -XPOST "http://localhost:9200/_shutdown"),
waiting for it all to stop (basically instantly), then starting the
elasticsearch script again in the same instance. from that point,
until I get a green on the status has been as long as 20 minutes


(Shay Banon) #6

Mmm, that should not take this long.... . Maybe it takes time for the describe instances API call on ec2 (which is very strange). Which version are you running?

Can you set discovery to TRACE logging in the logging.yml file (similar to how action is set there) and gist the logs for both nodes? We can try and derive the timings from it.

Also, the indices status API should give you a report of the time it took for the primary shards to recover from s3 (and if data was reused from local file system), and the time it took to recover for a replica shard to sync its state with the primary shard.

Last, I just noticed in your config that you do not set the recover_after_nodes setting. Can you just set it to 2 and see? The expected nodes does not affect things without the recover_after_nodes being set (and possibly, recover_after_time). This will make sure that the best local node data reuse deployment scenario will be taken.

-shay.banon
On Wednesday, April 13, 2011 at 1:26 AM, Matt Paul wrote:

It takes about 5 1/2 - 6 minutes (pretty consistently) for one of the
nodes to become the master. Once that happens, checking the status on
that node shows yellow. Until the other node joins the cluster, I just
get the "MasterNotDiscovered" error when trying to do any status, etc
on it

On Apr 12, 5:12 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

And how quick do you get to yellow status? Basically, a primary shard and its replica might drift in the index files (but not in content), and might require resync.

On Wednesday, April 13, 2011 at 1:11 AM, Matt Paul wrote:

Shay,

I'm restarting it on the same instances. All I am doing is issue a
_shutdown with curl (curl -XPOST "http://localhost:9200/_shutdown"),
waiting for it all to stop (basically instantly), then starting the
elasticsearch script again in the same instance. from that point,
until I get a green on the status has been as long as 20 minutes


(Matt Paul-2) #7

Shay,

After following your advice, looking through the trace, I found what
the issue was, thanks! Apparently when the list of ec2 nodes in the
security group is retrieved, it then attemps connect (in a blocking
fashion possibly?) to each one in turn. Turns out that the
Elasticsearch nodes that I'm using are the first and the last nodes in
a list of about 20 instances, so it took quite a while for them to
find each other, Once I added specific tags to just the Elasticsearch
ec2 instances and changed the ES config to search just for those
tags, it comes up in seconds now.

Thanks for all the help

Matt

On Apr 12, 5:38 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Mmm, that should not take this long.... . Maybe it takes time for the describe instances API call on ec2 (which is very strange). Which version are you running?

Can you set discovery to TRACE logging in the logging.yml file (similar to how action is set there) and gist the logs for both nodes? We can try and derive the timings from it.

Also, the indices status API should give you a report of the time it took for the primary shards to recover from s3 (and if data was reused from local file system), and the time it took to recover for a replica shard to sync its state with the primary shard.

Last, I just noticed in your config that you do not set the recover_after_nodes setting. Can you just set it to 2 and see? The expected nodes does not affect things without the recover_after_nodes being set (and possibly, recover_after_time). This will make sure that the best local node data reuse deployment scenario will be taken.

-shay.banon

On Wednesday, April 13, 2011 at 1:26 AM, Matt Paul wrote:

It takes about 5 1/2 - 6 minutes (pretty consistently) for one of the
nodes to become the master. Once that happens, checking the status on
that node shows yellow. Until the other node joins the cluster, I just
get the "MasterNotDiscovered" error when trying to do any status, etc
on it

On Apr 12, 5:12 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

And how quick do you get to yellow status? Basically, a primary shard and its replica might drift in the index files (but not in content), and might require resync.

On Wednesday, April 13, 2011 at 1:11 AM, Matt Paul wrote:

Shay,

I'm restarting it on the same instances. All I am doing is issue a
_shutdown with curl (curl -XPOST "http://localhost:9200/_shutdown"),
waiting for it all to stop (basically instantly), then starting the
elasticsearch script again in the same instance. from that point,
until I get a green on the status has been as long as 20 minutes


(Shay Banon) #8

Heya, great that you improved that!. I just pushed improvements to unicast discovery (and the ec2 discovery) to do the connection attempts in parallel (when needed) so it will be a bit speedier even in this case you mentioned.
On Wednesday, April 13, 2011 at 5:34 PM, Matt Paul wrote:

Shay,

After following your advice, looking through the trace, I found what
the issue was, thanks! Apparently when the list of ec2 nodes in the
security group is retrieved, it then attemps connect (in a blocking
fashion possibly?) to each one in turn. Turns out that the
Elasticsearch nodes that I'm using are the first and the last nodes in
a list of about 20 instances, so it took quite a while for them to
find each other, Once I added specific tags to just the Elasticsearch
ec2 instances and changed the ES config to search just for those
tags, it comes up in seconds now.

Thanks for all the help

Matt

On Apr 12, 5:38 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Mmm, that should not take this long.... . Maybe it takes time for the describe instances API call on ec2 (which is very strange). Which version are you running?

Can you set discovery to TRACE logging in the logging.yml file (similar to how action is set there) and gist the logs for both nodes? We can try and derive the timings from it.

Also, the indices status API should give you a report of the time it took for the primary shards to recover from s3 (and if data was reused from local file system), and the time it took to recover for a replica shard to sync its state with the primary shard.

Last, I just noticed in your config that you do not set the recover_after_nodes setting. Can you just set it to 2 and see? The expected nodes does not affect things without the recover_after_nodes being set (and possibly, recover_after_time). This will make sure that the best local node data reuse deployment scenario will be taken.

-shay.banon

On Wednesday, April 13, 2011 at 1:26 AM, Matt Paul wrote:

It takes about 5 1/2 - 6 minutes (pretty consistently) for one of the
nodes to become the master. Once that happens, checking the status on
that node shows yellow. Until the other node joins the cluster, I just
get the "MasterNotDiscovered" error when trying to do any status, etc
on it

On Apr 12, 5:12 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

And how quick do you get to yellow status? Basically, a primary shard and its replica might drift in the index files (but not in content), and might require resync.

On Wednesday, April 13, 2011 at 1:11 AM, Matt Paul wrote:

Shay,

I'm restarting it on the same instances. All I am doing is issue a
_shutdown with curl (curl -XPOST "http://localhost:9200/_shutdown"),
waiting for it all to stop (basically instantly), then starting the
elasticsearch script again in the same instance. from that point,
until I get a green on the status has been as long as 20 minutes


(Matt Paul-2) #9

Shay,

Thanks! I certainly didn't expect it, much appreciated!

Matt

On Apr 13, 6:29 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Heya, great that you improved that!. I just pushed improvements to unicast discovery (and the ec2 discovery) to do the connection attempts in parallel (when needed) so it will be a bit speedier even in this case you mentioned.

On Wednesday, April 13, 2011 at 5:34 PM, Matt Paul wrote:

Shay,

After following your advice, looking through the trace, I found what
the issue was, thanks! Apparently when the list of ec2 nodes in the
security group is retrieved, it then attemps connect (in a blocking
fashion possibly?) to each one in turn. Turns out that the
Elasticsearch nodes that I'm using are the first and the last nodes in
a list of about 20 instances, so it took quite a while for them to
find each other, Once I added specific tags to just the Elasticsearch
ec2 instances and changed the ES config to search just for those
tags, it comes up in seconds now.

Thanks for all the help

Matt

On Apr 12, 5:38 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Mmm, that should not take this long.... . Maybe it takes time for the describe instances API call on ec2 (which is very strange). Which version are you running?

Can you set discovery to TRACE logging in the logging.yml file (similar to how action is set there) and gist the logs for both nodes? We can try and derive the timings from it.

Also, the indices status API should give you a report of the time it took for the primary shards to recover from s3 (and if data was reused from local file system), and the time it took to recover for a replica shard to sync its state with the primary shard.

Last, I just noticed in your config that you do not set the recover_after_nodes setting. Can you just set it to 2 and see? The expected nodes does not affect things without the recover_after_nodes being set (and possibly, recover_after_time). This will make sure that the best local node data reuse deployment scenario will be taken.

-shay.banon

On Wednesday, April 13, 2011 at 1:26 AM, Matt Paul wrote:

It takes about 5 1/2 - 6 minutes (pretty consistently) for one of the
nodes to become the master. Once that happens, checking the status on
that node shows yellow. Until the other node joins the cluster, I just
get the "MasterNotDiscovered" error when trying to do any status, etc
on it

On Apr 12, 5:12 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

And how quick do you get to yellow status? Basically, a primary shard and its replica might drift in the index files (but not in content), and might require resync.

On Wednesday, April 13, 2011 at 1:11 AM, Matt Paul wrote:

Shay,

I'm restarting it on the same instances. All I am doing is issue a
_shutdown with curl (curl -XPOST "http://localhost:9200/_shutdown"),
waiting for it all to stop (basically instantly), then starting the
elasticsearch script again in the same instance. from that point,
until I get a green on the status has been as long as 20 minutes


(system) #10