Cluster unstable when recovering a node

I'm encountering again and again stability issues on the cluster when
rebuilding one of the nodes, this happens while the shards are re-balancing.

In more details: I have a cluster with 3 nodes. Data is by time so indexes
are created 1 per month, with 1 shard and 1 replica per index. Meaning each
node contains about 2/3 of the indexes.
This setup works great on general. Problems starts when one of the nodes
goes down.

For instance, yesterday Amazon had issues in one AZ, which brought one of
the nodes down for several hours, and I had to rebuild a new node from
backup. So the new node was added to the cluster, and the cluster began
rebalancing the shards.

Now while rebalancing occurred the cluster performance degraded severely.
Indexing times went up (and even came to a halt occasionally), search times
went app, meaning bad impact on our application. At some point another node
stopped responding and was removed from the cluster, and I needed to
restart that node also, which meant more rebalancing. So bringing the
cluster back to a smooth running state takes a long couple of hours in
which performance is between bad and terrible.

Is there any way to handle this issue? Am I missing something in the
cluster configuration that can prevent these problems?

Forgot to mention, I'm using 0.19.2

On Saturday, June 30, 2012 10:48:25 PM UTC+3, Rotem wrote:

I'm encountering again and again stability issues on the cluster when
rebuilding one of the nodes, this happens while the shards are re-balancing.

In more details: I have a cluster with 3 nodes. Data is by time so indexes
are created 1 per month, with 1 shard and 1 replica per index. Meaning each
node contains about 2/3 of the indexes.
This setup works great on general. Problems starts when one of the nodes
goes down.

For instance, yesterday Amazon had issues in one AZ, which brought one of
the nodes down for several hours, and I had to rebuild a new node from
backup. So the new node was added to the cluster, and the cluster began
rebalancing the shards.

Now while rebalancing occurred the cluster performance degraded severely.
Indexing times went up (and even came to a halt occasionally), search times
went app, meaning bad impact on our application. At some point another node
stopped responding and was removed from the cluster, and I needed to
restart that node also, which meant more rebalancing. So bringing the
cluster back to a smooth running state takes a long couple of hours in
which performance is between bad and terrible.

Is there any way to handle this issue? Am I missing something in the
cluster configuration that can prevent these problems?

On Sat, 2012-06-30 at 12:48 -0700, Rotem wrote:

Forgot to mention, I'm using 0.19.2

Yeah, copying large amounts of data causes a lot of I/O, which can
degrade performance to the point of not being usable.

Version 0.19.5 comes with throttling, which gives you more control over
how fast data is copied over:

Obviously, if you throttle, it'll take longer to recover, but if you
don't, your cluster may become unusable while you're recovering :slight_smile:

clint

On Saturday, June 30, 2012 10:48:25 PM UTC+3, Rotem wrote:
I'm encountering again and again stability issues on the
cluster when rebuilding one of the nodes, this happens while
the shards are re-balancing.

    In more details: I have a cluster with 3 nodes. Data is by
    time so indexes are created 1 per month, with 1 shard and 1
    replica per index. Meaning each node contains about 2/3 of the
    indexes.
    This setup works great on general. Problems starts when one of
    the nodes goes down.
    
    For instance, yesterday Amazon had issues in one AZ, which
    brought one of the nodes down for several hours, and I had to
    rebuild a new node from backup. So the new node was added to
    the cluster, and the cluster began rebalancing the shards. 
    
    Now while rebalancing occurred the cluster performance
    degraded severely. Indexing times went up (and even came to a
    halt occasionally), search times went app, meaning bad impact
    on our application. At some point another node stopped
    responding and was removed from the cluster, and I needed to
    restart that node also, which meant more rebalancing. So
    bringing the cluster back to a smooth running state takes a
    long couple of hours in which performance is between bad and
    terrible. 
    
    Is there any way to handle this issue? Am I missing something
    in the cluster configuration that can prevent these problems?