I'm encountering again and again stability issues on the cluster when
rebuilding one of the nodes, this happens while the shards are re-balancing.
In more details: I have a cluster with 3 nodes. Data is by time so indexes
are created 1 per month, with 1 shard and 1 replica per index. Meaning each
node contains about 2/3 of the indexes.
This setup works great on general. Problems starts when one of the nodes
goes down.
For instance, yesterday Amazon had issues in one AZ, which brought one of
the nodes down for several hours, and I had to rebuild a new node from
backup. So the new node was added to the cluster, and the cluster began
rebalancing the shards.
Now while rebalancing occurred the cluster performance degraded severely.
Indexing times went up (and even came to a halt occasionally), search times
went app, meaning bad impact on our application. At some point another node
stopped responding and was removed from the cluster, and I needed to
restart that node also, which meant more rebalancing. So bringing the
cluster back to a smooth running state takes a long couple of hours in
which performance is between bad and terrible.
Is there any way to handle this issue? Am I missing something in the
cluster configuration that can prevent these problems?
On Saturday, June 30, 2012 10:48:25 PM UTC+3, Rotem wrote:
I'm encountering again and again stability issues on the cluster when
rebuilding one of the nodes, this happens while the shards are re-balancing.
In more details: I have a cluster with 3 nodes. Data is by time so indexes
are created 1 per month, with 1 shard and 1 replica per index. Meaning each
node contains about 2/3 of the indexes.
This setup works great on general. Problems starts when one of the nodes
goes down.
For instance, yesterday Amazon had issues in one AZ, which brought one of
the nodes down for several hours, and I had to rebuild a new node from
backup. So the new node was added to the cluster, and the cluster began
rebalancing the shards.
Now while rebalancing occurred the cluster performance degraded severely.
Indexing times went up (and even came to a halt occasionally), search times
went app, meaning bad impact on our application. At some point another node
stopped responding and was removed from the cluster, and I needed to
restart that node also, which meant more rebalancing. So bringing the
cluster back to a smooth running state takes a long couple of hours in
which performance is between bad and terrible.
Is there any way to handle this issue? Am I missing something in the
cluster configuration that can prevent these problems?
Yeah, copying large amounts of data causes a lot of I/O, which can
degrade performance to the point of not being usable.
Version 0.19.5 comes with throttling, which gives you more control over
how fast data is copied over:
Obviously, if you throttle, it'll take longer to recover, but if you
don't, your cluster may become unusable while you're recovering
clint
On Saturday, June 30, 2012 10:48:25 PM UTC+3, Rotem wrote:
I'm encountering again and again stability issues on the
cluster when rebuilding one of the nodes, this happens while
the shards are re-balancing.
In more details: I have a cluster with 3 nodes. Data is by
time so indexes are created 1 per month, with 1 shard and 1
replica per index. Meaning each node contains about 2/3 of the
indexes.
This setup works great on general. Problems starts when one of
the nodes goes down.
For instance, yesterday Amazon had issues in one AZ, which
brought one of the nodes down for several hours, and I had to
rebuild a new node from backup. So the new node was added to
the cluster, and the cluster began rebalancing the shards.
Now while rebalancing occurred the cluster performance
degraded severely. Indexing times went up (and even came to a
halt occasionally), search times went app, meaning bad impact
on our application. At some point another node stopped
responding and was removed from the cluster, and I needed to
restart that node also, which meant more rebalancing. So
bringing the cluster back to a smooth running state takes a
long couple of hours in which performance is between bad and
terrible.
Is there any way to handle this issue? Am I missing something
in the cluster configuration that can prevent these problems?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.