Master election problem in 3 node cluster when one died


(Emmett Hogan) #1

I have a pretty simple ELK stack here with 4 ES nodes (3 data nodes and 1 client node).

I lost one of the data nodes and all of a sudden my Logstash servers couldn't send anything to the ES cluster. I found this error on the LS nodes and on the surviving ES nodes:

{"error":
{"root_cause":
[{"type":"cluster_block_exception",
"reason":"blocked by: [SERVICE_UNAVAILABLE/2/no master];"}],
"type":"cluster_block_exception",
"reason":"blocked by: [SERVICE_UNAVAILABLE/2/no master];"},"status":503}",
:class=>"Elasticsearch::Transport::Transport::Errors::ServiceUnavailable", ....

I am assuming that this is because the ES cluster could no longer elect a master...but I thought that 2 nodes in a 3 node cluster were enough. Did I miss something along the way?

(Sorry if this is basic ES knowledge. I found a posting about someone else running into this problem as well during a rolling upgrade, but there was not a response.)

I am running ES version 2.2.0.

Thanks in advance.

-Emmett


(Christian Dahlqvist) #2

What is minimum_master_nodes set to?


(Emmett Hogan) #3

Uhhh....crap. I thought that it was automatically computed to (# nodes/2 + 1) if not defined...but now that I read the config...that's not exactly what it says. So...it's actually commented out in my config! Doh!

So...I should have:

discovery.zen.minimum_master_nodes: 2

I am guessing that the right way to update this would be:

  • Change it on all three nodes
  • Shut down all my logstash nodes so nothing is getting sent to ES.
  • Shut down each ES node
  • Start each ES node

Otherwise, I'll run into the same problem as soon as I restart ES on a node to reload the new config.

Right?

Thanks for your help!

-Emmett


(Emmett Hogan) #4

While I am changing things in my config...should I also set:

gateway.recover_after_nodes: 2

-Emmett


(Emmett Hogan) #5

Answering my own question...I found this...

https://www.elastic.co/guide/en/elasticsearch/reference/current/restart-upgrade.html

-E


(Emmett Hogan) #6

I just noticed something strange though.

I follow the right procedure for restarting my cluster:

  1. Turned off shard reallocation
  2. Bounced the nodes
  3. Turned on shard reallocation

and everything looked fine...except that the last node that I brought up only has replicas on it. No primary shards at all!

-Emmett


(Mark Walkom) #7

That's nothing to be worried about :slight_smile:


(system) #8