New Master not found

Hey guys,

Came in this morning to find our cluster returning 503s. I've been going
over the logs to find out what happened and there's something funny
happening I'd like some input on. Here's a bit of a time line to give you
context (our nodes are called ROUTER0X and DATA0X):

  • 7:21, Router00 elasticsearch service is restarted (we don't know why
    these restarts happened yet)
  • 7:21, Router00 is removed from cluster
  • 7:21, Router00 elasticsearch service starts back up again and sees
    Router01 as the master
  • 7:21, Data03 elasticsearch service is restarted
  • 7:21, Data03 is removed from cluster
  • 7:21, Data03 elasticsearch service starts back up again and sees
    Router01 as the master
  • 7:21, Data00 elasticsearch service is restarted
  • 7:21, Data00 is removed from cluster
  • 7:21, Data00 elasticsearch service starts back up again and sees
    Router01 as the master
  • 7:21, Data01 elasticsearch service is restarted
  • 7:21, Data01 is removed from cluster
  • 7:21, Data01 elasticsearch service starts back up again and sees
    Router01 as the master
  • 7:25, Data02 elasticsearch service is restarted
  • 7:25, Router01 elasticsearch service is restarted
  • 7:25, Data03 is selected as new master

At this point we get into the funny state - Data02 thinks Router01 is the
master and fails to connect. Router00 sees Data03 as the master but then
almost immediately afterwards it sees Router01 as the master. Eventually
all nodes start trying to connect to Router01 as the master but fail - this
continues for an hour until we figure it out and restart the whole cluster.

This is the log from router00 from 7:25https://gist.github.com/getsometoast/5705025
This is the log for Data03 for same periodhttps://gist.github.com/getsometoast/5705034
And for router01 https://gist.github.com/getsometoast/5705039

The other nodes have a similar loop of failing to send the join request to
the master. BTW this is a 6 node cluster with min master set to 4. It
almost looks like the nodes have a cached connection to the master but it
never gets flushed after a new master is chosen...

Any insights greatly appreciated.

Regards,
James

--

This email, including attachments, is private and confidential. If you have
received this email in error please notify the sender and delete it from
your system. Emails are not secure and may contain viruses. No liability
can be accepted for viruses that might be transferred by this email or any
attachment. Any unauthorised copying of this message or unauthorised
distribution and publication of the information contained herein are
prohibited. 7digital Limited. Registered office: Unit F, Lower Ground
Floor, 5-25 Scrutton Street, Zetland House London EC2A 4HJ. Registered in
England and Wales. Registered No. 04843573.

--
"Please note new office address from June 10th 2013: 69 Wilson Street,
London EC2A 2BB"

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.