Split brain, confused master: failed to send rejoin request to <host>

Nothing showed up on google, so here we go:

Currently 3 nodes don't want to join the cluster of 15+ nodes.

one of the nodes logs:

[2012-11-20 07:17:35,623][DEBUG][discovery.zen.fd ] [es-028]
[master] starting fault detection against master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], reason [initial_join]
[2012-11-20 07:17:36,650][DEBUG][discovery.zen.fd ] [es-028]
[master] pinging a master
[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none} but we do not exists on it, act as if its master
failure
[2012-11-20 07:17:36,651][DEBUG][discovery.zen.fd ] [es-028]
[master] stopping fault detection against master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], reason [master failure, do not exists on
master, act as master failure]
[2012-11-20 07:17:36,652][INFO ][discovery.zen ] [es-028]
master_left
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], reason [do not exists on master, act as master
failure]

master reports:
[2012-11-20 07:16:35,257][TRACE][discovery.zen.ping.multicast] [es-030] [1]
received ping_request from
[[es-028][33TXhg8vTR6FrLdlwjwMvw][inet[/10.32.0.135:29300]]{name_attr=es-028,
master=true, river=none}], sending ping_response{target
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], cluster_name[bc]}
[2012-11-20 07:17:05,259][TRACE][discovery.zen.ping.multicast] [es-030] [1]
received ping_request from
[[es-028][33TXhg8vTR6FrLdlwjwMvw][inet[/10.32.0.135:29300]]{name_attr=es-028,
master=true, river=none}], sending ping_response{target
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], master
[[es-030][CgNqNgzcRGOFan4QJY6gGg][inet[/10.32.0.137:29300]]{name_attr=es-030,
master=true, river=none}], cluster_name[bc]}

No network issues.. telneting from port to port works fine.

Config from one of the nodes, removed filters and stuff:

cluster:
name: bc

node:
name: ${SHORT_HOSTNAME}
name_attr: ${SHORT_HOSTNAME}
river: "none"
master: true

index:
number_of_shards: 29
number_of_replicas: 3

gateway.type: local
gateway.recover_after_nodes: 9
gateway.recover_after_time: 5m
gateway.expected_nodes: 10

cluster.routing.allocation.node_initial_primaries_recoveries: 10
cluster.routing.allocation.node_concurrent_recoveries: 5
cluster.routing.allocation.cluster_concurrent_rebalance: 20

discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.timeout: 60s
discovery.zen.fd.ping_timeout: 60s

transport.tcp.port: 29300
http.port: 29200

index.search.slowlog.level: TRACE
index.search.slowlog.threshold.query.warn: 60s
index.search.slowlog.threshold.query.info: 10s
index.search.slowlog.threshold.query.debug: 5s
index.search.slowlog.threshold.query.trace: 2s

index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.trace: 200ms


already tried restarting the rogue nodes to no avail. Any help would be
appreciated!

--

wvl wrote:

Nothing showed up on google, so here we go:

Currently 3 nodes don't want to join the cluster of 15+ nodes.

[...]

already tried restarting the rogue nodes to no avail. Any help would be
appreciated!

If you haven't done it already, stop all indexing!

We need to inspect the split brain situation more closely. You need
to ask each node who it thinks is master. Use parallel-ssh, knife,
or whatever else allows you to run a command across all of the nodes.
Here's an example using pssh/jsawk (you could use any tool you like)
from https://gist.github.com/4118487:

pssh es curl -s https://raw.github.com/gist/4118487/master.sh | sh

If you use that, change the variables to reflect your network
interface and ES HTTP port. Output will look something like this for
each node in the cluster unless. If you're running more than one
node on each machine you'll have to tune it even further.

me: 192.168.20.109 master: ["bYijI5_vT3CsrhdukyCUMA","Lifeforce","inet[/192.168.20.109:9300]"]

That will give us a sense of how many different masters there are and
how the nodes disagree. Then we can examine logs in more detail and
figure out what to do.

-Drew

--

If you haven't done it already, stop all indexing!

When the cluster hits status red it already stops indexing right?

We need to inspect the split brain situation more closely

Right, did so at the time. Saw that we had 2 clusters: one of 3 nodes and
another of lets say 10+.

I provided log entries of the master with the most nodes and one of the
rogue nodes that refused to join the master's cluster, even if we shutdown
down all other nodes in the smaller cluster and make it so that this node
can become master itself.

So basically whatever we did we couldnt get the rogue to join the larger
cluster, with the only relevant logs being provided above.

--

The minimum_master_nodes settings seems eerily identical to the size of the
rogue cluster. Coincidence? I think not!

Ideally this settings should be (n/2)+1 where n is the number of nodes. Not
sure of the significance of expected_nodes, but I set it to the number of
nodes. Can't say why the cluster lost communication in the first place, but
you can avoid split brains. I have played around with the transport timeout
setting because nodes sometimes fail to communicate under high load
scenarios.

Cheers,

Ivan

On Mon, Nov 19, 2012 at 11:24 PM, wvl william.leese@meltwater.com wrote:

Currently 3 nodes don't want to join the cluster of 15+ nodes.

gateway.expected_nodes: 10
discovery.zen.minimum_master_nodes: 3

--

Right, we agree on the minimum_master_nodes.

Can't say why the cluster lost communication in the first place

So this is the issue that really concerns me - these unexplained node
failures.

--

Node failures typically occur when nodes don't respond to pings in timely
manner. There are two primary reasons for ping failures - network issues
and node being busy. A node can get unresponsive for a number of reasons.
For example, if you are not using mlockall option, a part of elasticsearch
memory can get swapped out and when GC kicks in on such node, it can render
node unresponsive for a long time if there is not enough physical memory on
the system. So, memory would be the first thing to check. What's the max
heap size for your nodes and how much memory your servers have? Another
possible reason for unresponsive nodes is too much traffic. Are you
monitoring CPU and I/O load on your elasticsearch nodes? How were nodes
doing just before the failures occurred?

On Tuesday, November 20, 2012 9:36:36 PM UTC-5, wvl wrote:

Right, we agree on the minimum_master_nodes.

Can't say why the cluster lost communication in the first place

So this is the issue that really concerns me - these unexplained node
failures.

--

Hi Igor,

Thanks for joining the discussion.
We have all these monitors in place and saw nothing alarming (IO, RAM,
CPU-wise), except a rise in the field_cache. We suspect there might be a
problem there especially since we had similiar issues with the field cache
growing out of hand in the past. Regardless, this didnt cause any OOM
condition or any obvious (CPU,RAM,IO) resource problem.

On Wednesday, November 21, 2012 11:50:48 PM UTC+9, Igor Motov wrote:

Node failures typically occur when nodes don't respond to pings in timely
manner. There are two primary reasons for ping failures - network issues
and node being busy. A node can get unresponsive for a number of reasons.
For example, if you are not using mlockall option, a part of elasticsearch
memory can get swapped out and when GC kicks in on such node, it can render
node unresponsive for a long time if there is not enough physical memory on
the system. So, memory would be the first thing to check. What's the max
heap size for your nodes and how much memory your servers have? Another
possible reason for unresponsive nodes is too much traffic. Are you
monitoring CPU and I/O load on your elasticsearch nodes? How were nodes
doing just before the failures occurred?

On Tuesday, November 20, 2012 9:36:36 PM UTC-5, wvl wrote:

Right, we agree on the minimum_master_nodes.

Can't say why the cluster lost communication in the first place

So this is the issue that really concerns me - these unexplained node
failures.

--

On Wednesday, November 21, 2012 1:12:28 AM UTC+1, wvl wrote:

If you haven't done it already, stop all indexing!

When the cluster hits status red it already stops indexing right?

No! Have a look into this nice demonstration from Lukas and Karel:
http://vimeo.com/44718093 (12:00)

Regards,
Peter.

--

I've started to use the following in nagios to watch for load-induced split
brain. It queries a node, finds all the nodes in its cluster, and makes
sure they all have the same master_node.

My environment has node.name set to hostname for readability, so it may not
work out of the box if yours doesn't.

On Tuesday, November 20, 2012 9:19:40 AM UTC-6, Drew Raines wrote:

wvl wrote:

Nothing showed up on google, so here we go:

Currently 3 nodes don't want to join the cluster of 15+ nodes.

[...]

already tried restarting the rogue nodes to no avail. Any help would be
appreciated!

If you haven't done it already, stop all indexing!

We need to inspect the split brain situation more closely. You need
to ask each node who it thinks is master. Use parallel-ssh, knife,
or whatever else allows you to run a command across all of the nodes.
Here's an example using pssh/jsawk (you could use any tool you like)
from https://gist.github.com/4118487:

pssh es curl -s https://raw.github.com/gist/4118487/master.sh | sh

If you use that, change the variables to reflect your network
interface and ES HTTP port. Output will look something like this for
each node in the cluster unless. If you're running more than one
node on each machine you'll have to tune it even further.

me: 192.168.20.109 master: ["bYijI5_vT3CsrhdukyCUMA","Lifeforce","inet[/
192.168.20.109:9300]"]

That will give us a sense of how many different masters there are and
how the nodes disagree. Then we can examine logs in more detail and
figure out what to do.

-Drew

--