Cluster unstable when recovering a node

rore · June 30, 2012, 7:48pm

I'm encountering again and again stability issues on the cluster when
rebuilding one of the nodes, this happens while the shards are re-balancing.

In more details: I have a cluster with 3 nodes. Data is by time so indexes
are created 1 per month, with 1 shard and 1 replica per index. Meaning each
node contains about 2/3 of the indexes.
This setup works great on general. Problems starts when one of the nodes
goes down.

For instance, yesterday Amazon had issues in one AZ, which brought one of
the nodes down for several hours, and I had to rebuild a new node from
backup. So the new node was added to the cluster, and the cluster began
rebalancing the shards.

Now while rebalancing occurred the cluster performance degraded severely.
Indexing times went up (and even came to a halt occasionally), search times
went app, meaning bad impact on our application. At some point another node
stopped responding and was removed from the cluster, and I needed to
restart that node also, which meant more rebalancing. So bringing the
cluster back to a smooth running state takes a long couple of hours in
which performance is between bad and terrible.

Is there any way to handle this issue? Am I missing something in the
cluster configuration that can prevent these problems?

rore · June 30, 2012, 7:48pm

Forgot to mention, I'm using 0.19.2

On Saturday, June 30, 2012 10:48:25 PM UTC+3, Rotem wrote:

I'm encountering again and again stability issues on the cluster when
rebuilding one of the nodes, this happens while the shards are re-balancing.

In more details: I have a cluster with 3 nodes. Data is by time so indexes
are created 1 per month, with 1 shard and 1 replica per index. Meaning each
node contains about 2/3 of the indexes.
This setup works great on general. Problems starts when one of the nodes
goes down.

For instance, yesterday Amazon had issues in one AZ, which brought one of
the nodes down for several hours, and I had to rebuild a new node from
backup. So the new node was added to the cluster, and the cluster began
rebalancing the shards.

Now while rebalancing occurred the cluster performance degraded severely.
Indexing times went up (and even came to a halt occasionally), search times
went app, meaning bad impact on our application. At some point another node
stopped responding and was removed from the cluster, and I needed to
restart that node also, which meant more rebalancing. So bringing the
cluster back to a smooth running state takes a long couple of hours in
which performance is between bad and terrible.

Is there any way to handle this issue? Am I missing something in the
cluster configuration that can prevent these problems?

Clinton_Gormley · July 2, 2012, 8:29am

On Sat, 2012-06-30 at 12:48 -0700, Rotem wrote:

Forgot to mention, I'm using 0.19.2

Yeah, copying large amounts of data causes a lot of I/O, which can
degrade performance to the point of not being usable.

Version 0.19.5 comes with throttling, which gives you more control over
how fast data is copied over:

github.com/elastic/elasticsearch

Store Throttling (node level and/or index level) with options on merge or all

opened 07:19PM - 21 Jun 12 UTC

closed 07:20PM - 21 Jun 12 UTC

kimchy

>feature v0.20.0.RC1 v0.19.5

Allow to configure store throttling (only applied on file system based storage),… which allows to control the maximum bytes per sec written to the file system. It can be configured to only apply while merging, or on all output operations. The setting can eb set on the node level (in which case the throttling is done _across_ all shards allocated on the node), or index level, in which case it only applied to that index. The node level settings are `indices.store.throttle.type` to set the type, with values of `none`, `merge` and `all` (defaults to `none`). And, also, `indices.store.throttle.max_bytes_per_sec` (defaults to `0`), which can be set to something like `1mb`. The index level settings is `index.store.throttle.type` for the type, with values of `node`, `none`, `merge`, and `all`. Defaults to `node` which will use the "shared" throttling on the node level. And, `index.store.throttle.max_bytes_per_sec` (defaults to `0`).

Obviously, if you throttle, it'll take longer to recover, but if you
don't, your cluster may become unusable while you're recovering

clint

On Saturday, June 30, 2012 10:48:25 PM UTC+3, Rotem wrote:
I'm encountering again and again stability issues on the
cluster when rebuilding one of the nodes, this happens while
the shards are re-balancing.

    In more details: I have a cluster with 3 nodes. Data is by
    time so indexes are created 1 per month, with 1 shard and 1
    replica per index. Meaning each node contains about 2/3 of the
    indexes.
    This setup works great on general. Problems starts when one of
    the nodes goes down.
    
    For instance, yesterday Amazon had issues in one AZ, which
    brought one of the nodes down for several hours, and I had to
    rebuild a new node from backup. So the new node was added to
    the cluster, and the cluster began rebalancing the shards. 
    
    Now while rebalancing occurred the cluster performance
    degraded severely. Indexing times went up (and even came to a
    halt occasionally), search times went app, meaning bad impact
    on our application. At some point another node stopped
    responding and was removed from the cluster, and I needed to
    restart that node also, which meant more rebalancing. So
    bringing the cluster back to a smooth running state takes a
    long couple of hours in which performance is between bad and
    terrible. 
    
    Is there any way to handle this issue? Am I missing something
    in the cluster configuration that can prevent these problems?

Topic		Replies	Views
Cluster performance when adding a node Elasticsearch	3	373	July 6, 2017
Unexpected rebalancing behavior Elasticsearch	4	409	July 6, 2017
Shard Allocation Problem Elasticsearch	3	340	July 6, 2017
Share Rebalancing on large clusters (2.4) Elasticsearch	5	925	January 19, 2017
Shard Allocation Problem Elasticsearch	3	732	July 6, 2017

Cluster unstable when recovering a node

Related topics