Stress Free Guide To Expanding a Cluster

Earlier this week we discovered that our three node elasticsearch
cluster needed to be expanded as it was getting dangerously close to
maximum capacity. I was nervous about this and read up the best I
could on best practices to doing this. The only information I seemed
to be able to find is to ensure that the new nodes cannot be elected
as masters when they join to avoid a split brain scenario. Fair
enough.

I launched two new EC2 instances to join the cluster and watched. Some
shards began relocating, no big deal. Six hours later I checked in and
some shards were still locating, one shard was recovering. Weird but
whatever... the cluster health is still green and searches are working
fine. Then I got an alert at 2:30am that the cluster state is now
yellow and find that we have 3 shards marked as recovering and 2
shards that unassigned. The cluster still technically works but 24
hours later after the new nodes were added I feel like my only choice
to get a green cluster again will be to simply launch 5 fresh nodes
and replay all the data from backups into it. Ugggggh.

SERIOUSLY! What can I do to prevent this? I feel like I am missing
something because I always heard the strength of elasticsearch is its
ease of scaling out but it feels like every time I try it falls to the
floor. :frowning:

Thanks!
James

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJreXKD3Wuyiq5XxGdSWyj3a%3DM2Xd5GQxZ9J3EywoT-OP52qFA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

On Wed, Jun 25, 2014 at 8:05 AM, James Carr james.r.carr@gmail.com wrote:

I launched two new EC2 instances to join the cluster and watched. Some
shards began relocating, no big deal. Six hours later I checked in and
some shards were still locating, one shard was recovering. Weird but
whatever... the cluster health is still green and searches are working
fine.

I add new nodes every once in a while and it can take a few hours for
everything to balance out, but six hours is a bit long. Its possible. Do
you have graphs of the count of relocating shards? Something like this can
really help you figure out if everything balanced out at some point and
then unbalanced. Example:
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Elasticsearch%20cluster%20eqiad&h=elastic1001.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1403698335&v=0&m=es_relocating_shards&vl=shards&ti=es_relocating_shards&z=large

Then I got an alert at 2:30am that the cluster state is now

yellow and find that we have 3 shards marked as recovering and 2
shards that unassigned. The cluster still technically works but 24
hours later after the new nodes were added I feel like my only choice
to get a green cluster again will be to simply launch 5 fresh nodes
and replay all the data from backups into it. Ugggggh.

This sounds like one of the nodes bounced. It can take a long time to
recover from that. Its something that is being worked on. Check the logs
and see if you see anything about it.

One thing to make sure of is that you set the number of master nodes
correctly on all nodes. If you have five master eligible nodes then set it
to 3. If the two new nodes aren't master eligible (you have three master
eligible nodes) then set it to 2.

SERIOUSLY! What can I do to prevent this? I feel like I am missing
something because I always heard the strength of elasticsearch is its
ease of scaling out but it feels like every time I try it falls to the
floor. :frowning:

Its always been pretty painless for me. I did have trouble when I added
nodes that were broken: one time I added nodes without SSDs to a cluster
with SSDs. Another time I didn't set the heap size on the new nodes and
they worked until some shards moved to them. Then they fell over.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPmjWd0CNzBEv6HC8J-P91qHS46Micb7VjmO2LTXN4JY2QGCkg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Try setting "indices.recovery.max_bytes_per_sec" much higher for faster
recovery. The default is 20mb/s, and there's a bug in versions prior to 1.2
that rate limit to even lower than that. You didn't specify how big your
indices are, but I can fairly accurately predict how long it'll take for
the cluster to go green with that parameter.

mike

On Wednesday, June 25, 2014 8:20:02 AM UTC-4, Nikolas Everett wrote:

On Wed, Jun 25, 2014 at 8:05 AM, James Carr <james....@gmail.com
<javascript:>> wrote:

I launched two new EC2 instances to join the cluster and watched. Some
shards began relocating, no big deal. Six hours later I checked in and
some shards were still locating, one shard was recovering. Weird but
whatever... the cluster health is still green and searches are working
fine.

I add new nodes every once in a while and it can take a few hours for
everything to balance out, but six hours is a bit long. Its possible. Do
you have graphs of the count of relocating shards? Something like this can
really help you figure out if everything balanced out at some point and
then unbalanced. Example:
http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Elasticsearch%20cluster%20eqiad&h=elastic1001.eqiad.wmnet&r=hour&z=default&jr=&js=&st=1403698335&v=0&m=es_relocating_shards&vl=shards&ti=es_relocating_shards&z=large

Then I got an alert at 2:30am that the cluster state is now

yellow and find that we have 3 shards marked as recovering and 2
shards that unassigned. The cluster still technically works but 24
hours later after the new nodes were added I feel like my only choice
to get a green cluster again will be to simply launch 5 fresh nodes
and replay all the data from backups into it. Ugggggh.

This sounds like one of the nodes bounced. It can take a long time to
recover from that. Its something that is being worked on. Check the logs
and see if you see anything about it.

One thing to make sure of is that you set the number of master nodes
correctly on all nodes. If you have five master eligible nodes then set it
to 3. If the two new nodes aren't master eligible (you have three master
eligible nodes) then set it to 2.

SERIOUSLY! What can I do to prevent this? I feel like I am missing
something because I always heard the strength of elasticsearch is its
ease of scaling out but it feels like every time I try it falls to the
floor. :frowning:

Its always been pretty painless for me. I did have trouble when I added
nodes that were broken: one time I added nodes without SSDs to a cluster
with SSDs. Another time I didn't set the heap size on the new nodes and
they worked until some shards moved to them. Then they fell over.

Nik

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17a60021-e0bc-4806-8573-f37a9ef91b89%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.