Cluster availability

Carlos_Daniel_Ruvalc · September 6, 2013, 9:40pm

Hi list,

Where I work we have started using elasticsearch recently and have worked
our way on customizing schema, analyzers, etc. We are deploying on aws, we
are happy with it except for one behavior I haven't quite managed to work
around.

If I have a cluster of 5 instances (5 shards, 2 replicas) and only one of
them goes down (for whatever reason but unplanned) this usually brings down
the entire cluster for a couple of minutes, I assume it is because it is
rebalancing, but is there any way to avoid this downtime?

We have all instances behind a loadbalancer so there is no single point of
failure (like the node we are using to access the cluster goes down), I
tried downing a random instance and still see this behavior so I cannot
blame on downing a specific master node.

Regards,
Carlos Ruvalcaba

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

dadoonet · September 7, 2013, 12:24am

I don't think you should see 'downtime' when a node disappear as long as you have at least 1 replica. Rebalancing shards does not "stop" the cluster.

I would like to know more about what you are seeing. Can you add more details about your ES setup?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 6 sept. 2013 à 23:40, Carlos Daniel Ruvalcaba Valenzuela clsdaniel@gmail.com a écrit :

Hi list,

Where I work we have started using elasticsearch recently and have worked our way on customizing schema, analyzers, etc. We are deploying on aws, we are happy with it except for one behavior I haven't quite managed to work around.

If I have a cluster of 5 instances (5 shards, 2 replicas) and only one of them goes down (for whatever reason but unplanned) this usually brings down the entire cluster for a couple of minutes, I assume it is because it is rebalancing, but is there any way to avoid this downtime?

We have all instances behind a loadbalancer so there is no single point of failure (like the node we are using to access the cluster goes down), I tried downing a random instance and still see this behavior so I cannot blame on downing a specific master node.

Regards,
Carlos Ruvalcaba

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Carlos_Daniel_Ruvalc · September 7, 2013, 11:34pm

Hi,

We basically piggyback on AWS elastic beanstalk, we already have php app
serving as the elasticsearch gatekeeper (as our search functions are very
static).

How it works is that once the app is deployed to a beanstalk instance it
checks if elasticsearch is installed in the local instance, if not it will
download it from our own tarball (with our specific config options), then
it sets up permissions, installs ES as a service (via servicewrapper) and
the runs it, we have configured it to use the aws plugin and to use ec2
discovery. If the app is installed it check if there is an update on the
config to restart the instance (which almost never happens).

Once the instance is up the frontend app connects to it to serve the search
request, this is basically the configuration for each instance in the
beanstalk group, it also means all of them share the same ES configuration
(master eligible, store data, 5 shards, 2 replicas, all other defaults),
our dataset is relatively small and replicates very quickly so I can see
the cluster behind the beanstalk setup loadbalancer working fine.

The problem comes when autoscaling kicks in, by default it runs two
instances, when needed it adds a new instance which joins the ES cluster
fine, in load tests I can clearly see the improvements on requests per
second and cluster avg load when it kicks in, but after the load is gone
the autoscaling group scales the cluster back, terminating the other
instances (usually only one), it is in this moment that requests to the
other nodes begin to fail, the app returns a specific error that means that
it could not connect to the local ES instance, so I know the request is
routed to a healthy ec2 instance and not just lost by the terminating one,
this goes on for a couple of minutes and then the ES responds fine with no
further issues. I also tried connecting to the instance and query the ES
instance directly (via curl), I don't have at hand the returned string but
basically said something like master node not discovered or found.

I'm not sure if adding a third instance (having 3 as minimum) could improve
this (that is if what is causing this is a split brand kind of behavior
between the two nodes), any thoughts?

Regards,
Carlos Ruvalcaba

On Friday, September 6, 2013 5:24:53 PM UTC-7, David Pilato wrote:

I don't think you should see 'downtime' when a node disappear as long as
you have at least 1 replica. Rebalancing shards does not "stop" the cluster.

I would like to know more about what you are seeing. Can you add more
details about your ES setup?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 6 sept. 2013 à 23:40, Carlos Daniel Ruvalcaba Valenzuela <
clsd...@gmail.com <javascript:>> a écrit :

Hi list,

Where I work we have started using elasticsearch recently and have worked
our way on customizing schema, analyzers, etc. We are deploying on aws, we
are happy with it except for one behavior I haven't quite managed to work
around.

If I have a cluster of 5 instances (5 shards, 2 replicas) and only one of
them goes down (for whatever reason but unplanned) this usually brings down
the entire cluster for a couple of minutes, I assume it is because it is
rebalancing, but is there any way to avoid this downtime?

We have all instances behind a loadbalancer so there is no single point of
failure (like the node we are using to access the cluster goes down), I
tried downing a random instance and still see this behavior so I cannot
blame on downing a specific master node.

Regards,
Carlos Ruvalcaba

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Norberto_Meijome · September 9, 2013, 6:49am

The exact curl you use + error you get should help..
On 08/09/2013 9:34 AM, "Carlos Daniel Ruvalcaba Valenzuela" <
clsdaniel@gmail.com> wrote:

Hi,

We basically piggyback on AWS elastic beanstalk, we already have php app
serving as the elasticsearch gatekeeper (as our search functions are very
static).

How it works is that once the app is deployed to a beanstalk instance it
checks if elasticsearch is installed in the local instance, if not it will
download it from our own tarball (with our specific config options), then
it sets up permissions, installs ES as a service (via servicewrapper) and
the runs it, we have configured it to use the aws plugin and to use ec2
discovery. If the app is installed it check if there is an update on the
config to restart the instance (which almost never happens).

Once the instance is up the frontend app connects to it to serve the
search request, this is basically the configuration for each instance in
the beanstalk group, it also means all of them share the same ES
configuration (master eligible, store data, 5 shards, 2 replicas, all other
defaults), our dataset is relatively small and replicates very quickly so I
can see the cluster behind the beanstalk setup loadbalancer working fine.

The problem comes when autoscaling kicks in, by default it runs two
instances, when needed it adds a new instance which joins the ES cluster
fine, in load tests I can clearly see the improvements on requests per
second and cluster avg load when it kicks in, but after the load is gone
the autoscaling group scales the cluster back, terminating the other
instances (usually only one), it is in this moment that requests to the
other nodes begin to fail, the app returns a specific error that means that
it could not connect to the local ES instance, so I know the request is
routed to a healthy ec2 instance and not just lost by the terminating one,
this goes on for a couple of minutes and then the ES responds fine with no
further issues. I also tried connecting to the instance and query the ES
instance directly (via curl), I don't have at hand the returned string but
basically said something like master node not discovered or found.

I'm not sure if adding a third instance (having 3 as minimum) could
improve this (that is if what is causing this is a split brand kind of
behavior between the two nodes), any thoughts?

Regards,
Carlos Ruvalcaba

On Friday, September 6, 2013 5:24:53 PM UTC-7, David Pilato wrote:

I don't think you should see 'downtime' when a node disappear as long as
you have at least 1 replica. Rebalancing shards does not "stop" the cluster.

I would like to know more about what you are seeing. Can you add more
details about your ES setup?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 6 sept. 2013 à 23:40, Carlos Daniel Ruvalcaba Valenzuela <
clsd...@gmail.com> a écrit :

Hi list,

Where I work we have started using elasticsearch recently and have worked
our way on customizing schema, analyzers, etc. We are deploying on aws, we
are happy with it except for one behavior I haven't quite managed to work
around.

If I have a cluster of 5 instances (5 shards, 2 replicas) and only one of
them goes down (for whatever reason but unplanned) this usually brings down
the entire cluster for a couple of minutes, I assume it is because it is
rebalancing, but is there any way to avoid this downtime?

We have all instances behind a loadbalancer so there is no single point
of failure (like the node we are using to access the cluster goes down), I
tried downing a random instance and still see this behavior so I cannot
blame on downing a specific master node.

Regards,
Carlos Ruvalcaba

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Carlos_Daniel_Ruval1 · September 18, 2013, 12:47am

Took me some time to get with this again, but here is the specific error
I'm getting:

curl -XGET 'http://localhost:9200/ev/_status?pretty=1'
{
"error" : "ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state
not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]",
"status" : 503
}

This is after having 2 instances up and running, then adding two more,
everything works ok and they are all in sync, then I just kill 2 instances,
I'm left with 2 intances with that message.

I think that maybe the aws plugin is still seeing the terminated instances
and trying to contact them?

Regards,
Carlos Ruvalcaba

On Sunday, September 8, 2013 11:49:05 PM UTC-7, Norberto Meijome wrote:

The exact curl you use + error you get should help..
On 08/09/2013 9:34 AM, "Carlos Daniel Ruvalcaba Valenzuela" <
clsd...@gmail.com <javascript:>> wrote:

Hi,

We basically piggyback on AWS elastic beanstalk, we already have php app
serving as the elasticsearch gatekeeper (as our search functions are very
static).

How it works is that once the app is deployed to a beanstalk instance it
checks if elasticsearch is installed in the local instance, if not it will
download it from our own tarball (with our specific config options), then
it sets up permissions, installs ES as a service (via servicewrapper) and
the runs it, we have configured it to use the aws plugin and to use ec2
discovery. If the app is installed it check if there is an update on the
config to restart the instance (which almost never happens).

Once the instance is up the frontend app connects to it to serve the
search request, this is basically the configuration for each instance in
the beanstalk group, it also means all of them share the same ES
configuration (master eligible, store data, 5 shards, 2 replicas, all other
defaults), our dataset is relatively small and replicates very quickly so I
can see the cluster behind the beanstalk setup loadbalancer working fine.

The problem comes when autoscaling kicks in, by default it runs two
instances, when needed it adds a new instance which joins the ES cluster
fine, in load tests I can clearly see the improvements on requests per
second and cluster avg load when it kicks in, but after the load is gone
the autoscaling group scales the cluster back, terminating the other
instances (usually only one), it is in this moment that requests to the
other nodes begin to fail, the app returns a specific error that means that
it could not connect to the local ES instance, so I know the request is
routed to a healthy ec2 instance and not just lost by the terminating one,
this goes on for a couple of minutes and then the ES responds fine with no
further issues. I also tried connecting to the instance and query the ES
instance directly (via curl), I don't have at hand the returned string but
basically said something like master node not discovered or found.

I'm not sure if adding a third instance (having 3 as minimum) could
improve this (that is if what is causing this is a split brand kind of
behavior between the two nodes), any thoughts?

Regards,
Carlos Ruvalcaba

On Friday, September 6, 2013 5:24:53 PM UTC-7, David Pilato wrote:

I don't think you should see 'downtime' when a node disappear as long as
you have at least 1 replica. Rebalancing shards does not "stop" the cluster.

I would like to know more about what you are seeing. Can you add more
details about your ES setup?

--
David
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 6 sept. 2013 à 23:40, Carlos Daniel Ruvalcaba Valenzuela <
clsd...@gmail.com> a écrit :

Hi list,

Where I work we have started using elasticsearch recently and have
worked our way on customizing schema, analyzers, etc. We are deploying on
aws, we are happy with it except for one behavior I haven't quite managed
to work around.

If I have a cluster of 5 instances (5 shards, 2 replicas) and only one
of them goes down (for whatever reason but unplanned) this usually brings
down the entire cluster for a couple of minutes, I assume it is because it
is rebalancing, but is there any way to avoid this downtime?

We have all instances behind a loadbalancer so there is no single point
of failure (like the node we are using to access the cluster goes down), I
tried downing a random instance and still see this behavior so I cannot
blame on downing a specific master node.

Regards,
Carlos Ruvalcaba

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Update elasticsearch cluster hardware with no downtime Elasticsearch	3	1361	July 5, 2017
Master node hangs when multiple data nodes are shutdown at the same time Elasticsearch	6	954	July 6, 2017
Rebalancing of shards during temporary unavailability of one node Elasticsearch	1	337	July 6, 2017
ES cluster in Amazon auto scaling group Elasticsearch	2	1031	July 6, 2017
Cluster questions Elasticsearch	7	353	July 6, 2017

Cluster availability

Regards, Carlos Ruvalcaba

Related topics

Regards,
Carlos Ruvalcaba