ES cluster recovery

Hello,

I have a question about cluster recovery after the cluster goes into an
unhealthy state.
Let's assume the following.

We have a cluster with 9 nodes.
3 master nodes (esmX) (master=true, data=false)
4 data nodes (esdX) (master=false, data=true)
2 client nodes (escX) (master=false, data=false)
minimum_master_nodes is set to 2.

The cluster is deployed across multiple racks.
rack 1
esm1, esm2, esd1, esd2 and esc1

rack2
esm3, esd3, esd4 and esc2

With this configuration I can lose rack 2 and the cluster still fulfills
the requirements to form a proper cluster.
If I would loose rack 1 forever or a long time, I would manual spin up a
second master node in rack 2 that to fulfill 2 minimum masters.

If now the network connection between the 2 racks fails, the cluster goes
in an unhealthy state.
After a while rack 1 will be back online and everything is working again.
I noticed that this takes up to many minutes. Even after playing with the
timeout settings for failure detection it takes relative long until it
thinks that the other nodes are gone and before it's back to normal.

My question is, is that normal? Do I have to live with a few minutes
downtime if parts of the cluster becomes unreachable?
Or are there any options I could still try to tune?

Thanks
Marco

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f9233b5-bc4c-47f0-8a42-7d38db8dc7fb%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

maybe I misread your mail, but I am actually not sure, what part exactly is
taking so much time. Is it the pinging of nodes, to finally remove a part
of the cluster? Is it the recovery and copying of data in order to recreate
a working cluster?
Also, if you have different racks, you could use (I guess you do) rack
based awareness allocation to make sure, all your data is available in case
a rack fails.
And what is a unhealthy state exactly?

Interested to get a few more information here.

--Alex

On Fri, Jan 17, 2014 at 2:54 PM, Marco Schirrmeister <
mschirrmeister@gmail.com> wrote:

Hello,

I have a question about cluster recovery after the cluster goes into an
unhealthy state.
Let's assume the following.

We have a cluster with 9 nodes.
3 master nodes (esmX) (master=true, data=false)
4 data nodes (esdX) (master=false, data=true)
2 client nodes (escX) (master=false, data=false)
minimum_master_nodes is set to 2.

The cluster is deployed across multiple racks.
rack 1
esm1, esm2, esd1, esd2 and esc1

rack2
esm3, esd3, esd4 and esc2

With this configuration I can lose rack 2 and the cluster still fulfills
the requirements to form a proper cluster.
If I would loose rack 1 forever or a long time, I would manual spin up a
second master node in rack 2 that to fulfill 2 minimum masters.

If now the network connection between the 2 racks fails, the cluster goes
in an unhealthy state.
After a while rack 1 will be back online and everything is working again.
I noticed that this takes up to many minutes. Even after playing with the
timeout settings for failure detection it takes relative long until it
thinks that the other nodes are gone and before it's back to normal.

My question is, is that normal? Do I have to live with a few minutes
downtime if parts of the cluster becomes unreachable?
Or are there any options I could still try to tune?

Thanks
Marco

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8f9233b5-bc4c-47f0-8a42-7d38db8dc7fb%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGCwEM-2xL9t%3Dn8_Uwk%3DLmhKDcNkZnuSDtvsJ9f65sRDzw3Mqg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Interesting, my last post was deleted. I will try it again.

On Monday, January 20, 2014 10:26:37 AM UTC+1, Alexander Reelsen wrote:

maybe I misread your mail, but I am actually not sure, what part exactly is

taking so much time. Is it the pinging of nodes, to finally remove a part
of the cluster? Is it the recovery and copying of data in order to recreate
a working cluster?

I'm not sure what it all does under the hood. But I assume it's only
pinging the nodes and removing them from the cluster.
I don't think it needs to copy data, since one rack has all the data and it
only needs to make a replica primary.

Also, if you have different racks, you could use (I guess you do) rack
based awareness allocation to make sure, all your data is available in case
a rack fails.

We are already doing this. At least one replica of a shard is always on a
node in a different rack.
Example,
cluster.routing.allocation.awareness.force.zone.values: rack1,rack2
cluster.routing.allocation.awareness.attributes: ms_rack
node.ms_rack: rack1

And what is a unhealthy state exactly?

With unhealthy I mean that it shows in the head plugin not correct
information.
Normally it shows for example "esm1 cluster health: green (9, 5)" and under
"Cluster Overview" are all the nodes in the cluster.
When the network connection between the racks is lost, there immediately no
nodes under cluster overview visible. Cluster health is still showing green
and the full number of nodes and shards.
The node count then drops slowly down. Depends on how fast the cluster
detects and removes dead nodes.

Here is now an example when connected to head on esm1 during a network loss.
9:23:24 network loss between rack1 and rack2
9:23:43 esm1 removed esc2 node (head shows no nodes and health as green 8,5
)
9:26:13 esm1 removed esd3 node (head shows no nodes and health as green 7,5
)
9:26:43 esm1 removed esd4 node (head shows no nodes and health as green 6,5
)
9:27:13 esm1 removed esm2 node (head shows no nodes and health as yellow
5,5 )

9:27:13 es head shows cluster health as yellow (5,5) which is correct
(since some copies are not available) and also the 5 remaining nodes again
under cluster overview. 2 data, 2 master, 1 client.
Result is that reading and writing to the cluster stalled for about 4
minutes.

When the log on esm1 shows that it removes esm2, it shows "reason failed to
ping, tried [3] times, each with maximum [5s] timeout".
Which is in my eyes not really true based on the timing when it actually
happens. Because it's 4 minutes later.

Version with which I'm testing is 0.90.10.

cluster.name: MSES1Test1
node.name: "esm1"
node.master: true
node.data: false
path.data: /var/lib/elasticsearch/data
path.logs: /var/log/elasticsearch
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["esm1.ms.lan", "esm2.ms.lan",
"esm3.ms.lan"]
cluster.routing.allocation.awareness.force.zone.values: rack1,rack2
cluster.routing.allocation.awareness.attributes: ms_rack
node.ms_rack: rack1
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 5s

Thanks
Marco

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/918a89d2-bcdd-4843-bea1-64aa982b8b49%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.