Node failing to join the cluster after reboot

So I've been working with an elasticsearch cluster for a couple months now.
I'm finally getting it into production, but after a soft launch I realized
that I needed to allocate more ram to each instance. I'm running 3 boxes
with 3 instances of elasticsearch each. I took the first box down, added
the ram, and brought it back up into the cluster. All is well. Moving to
the second, I now have a routing/load balancing node that won't come back
into the cluster. The other 2 instances joined fine. I tried several times
to reboot the failed instance with no luck. I keep getting a "no masterNode
returned" error.

Releveant Info

3 Machines, 3 Instances each
Instance 1 + 2: Data + eligible master
Instance 3: No data, no master (load balancing/routing only)

OS: Centos 6.4
Elasticsearch version 0.90.3 ( I know this is somewhat dated now, we must
extensively test new releases in dev/test before moving to production)

IPTables:

Elasticsearch Rest API (HTTP)

-A INPUT -m state --state NEW -m tcp -p tcp --dport 9200 -j ACCEPT
#Elasticserach Transport Service
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9300 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9301 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9302 -j ACCEPT

Allow Mulitcast for ElasticSearch auto-discovery

-A INPUT -m pkttype --pkt-type multicast -j ACCEPT

Trace discovery logs:
https://gist.github.com/jumpinjoeadams/7008972

Relevant ES config:
cluster.name: NightRunnerProd
http.enabled: false ( on instances 1+2 only)
gateway.recover_after_nodes: 4
gateway.recover_after_time: 20s
gateway.expected_nodes: 6 ( These recovery options were lowered to
resolve this issue previously, but it just prolonged the issue apparently)
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.11.253.173[9300-9305]",
"10.11.253.174[9300-9305]", "10.11.253.175[9300-9305]"]
node.master: false ( instance 3 only)
node.data: false (instance 3 only)

I have tried disabling iptables
SELinux has no errors
Google provides no help.
Every time the node comes up, it doesn't join the cluster, just gives the
503. The failed node is the balancing node on machine 2.
I'm going nuts trying to figure out why this happens from time to time.

Thanks in advance!
Joe

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It has been a week and I have still had no luck. Anyone have any ideas? I'm
at a loss for ideas.

On Wednesday, October 16, 2013 11:00:58 AM UTC-4, Joe Adams wrote:

So I've been working with an elasticsearch cluster for a couple months
now. I'm finally getting it into production, but after a soft launch I
realized that I needed to allocate more ram to each instance. I'm running 3
boxes with 3 instances of elasticsearch each. I took the first box down,
added the ram, and brought it back up into the cluster. All is well. Moving
to the second, I now have a routing/load balancing node that won't come
back into the cluster. The other 2 instances joined fine. I tried several
times to reboot the failed instance with no luck. I keep getting a "no
masterNode returned" error.

Releveant Info

3 Machines, 3 Instances each
Instance 1 + 2: Data + eligible master
Instance 3: No data, no master (load balancing/routing only)

OS: Centos 6.4
Elasticsearch version 0.90.3 ( I know this is somewhat dated now, we must
extensively test new releases in dev/test before moving to production)

IPTables:

Elasticsearch Rest API (HTTP)

-A INPUT -m state --state NEW -m tcp -p tcp --dport 9200 -j ACCEPT
#Elasticserach Transport Service
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9300 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9301 -j ACCEPT
-A INPUT -m state --state NEW -m tcp -p tcp --dport 9302 -j ACCEPT

Allow Mulitcast for Elasticsearch auto-discovery

-A INPUT -m pkttype --pkt-type multicast -j ACCEPT

Trace discovery logs:
https://gist.github.com/jumpinjoeadams/7008972

Relevant ES config:
cluster.name: NightRunnerProd
http.enabled: false ( on instances 1+2 only)
gateway.recover_after_nodes: 4
gateway.recover_after_time: 20s
gateway.expected_nodes: 6 ( These recovery options were lowered to
resolve this issue previously, but it just prolonged the issue apparently)
discovery.zen.minimum_master_nodes: 2
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: ["10.11.253.173[9300-9305]",
"10.11.253.174[9300-9305]", "10.11.253.175[9300-9305]"]
node.master: false ( instance 3 only)
node.data: false (instance 3 only)

I have tried disabling iptables
SELinux has no errors
Google provides no help.
Every time the node comes up, it doesn't join the cluster, just gives the
503. The failed node is the balancing node on machine 2.
I'm going nuts trying to figure out why this happens from time to time.

Thanks in advance!
Joe

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Not sure what is going on. Which node is now master?

gateway.expected_nodes: 6 means, the cluster is waiting for 6 nodes. But
afaik you have 3 nodes. Is this true?

A node which is non-master eligible can complicate things (for a simple
mind like me). discovery.zen.minimum_master_nodes: 2 can be quite
optimistic. If you have three nodes and one node is non-master eligible, I
would have chosen discovery.zen.minimum_master_nodes: 3 so all nodes must
be up before master is elected.

Just my 2p

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The master is NightRunner103.Instance2. There are 9 total nodes. 6 are data
and 3 are non data. I will change my minimum master nodes and see if that
helps.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you have 9 nodes, you should take care to set minimum master nodes at
least to 5 (half the cluster nodes plus 1)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The 3 non data are not eligible masters though. They are set as load
balancing. Is there a downside to changing these nodes to be eligible
masters? That would give me an odd number of nodes and help prevent split
brain.

On Wednesday, October 23, 2013 11:29:19 AM UTC-4, Jörg Prante wrote:

If you have 9 nodes, you should take care to set minimum master nodes at
least to 5 (half the cluster nodes plus 1)

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Is there are any downside to having nodes that have only one role - master,
data or client?
For example, say, I have a cluster where I have 3 masters, 5 data nodes and
2 client nodes.
5 data nodes and 2 client nodes are not master eligible and 2 client nodes
are only meant for load balancing client requests.
And in this case, I set minimum master nodes to be 2.

Do you see any problem with this configuration?

On Wed, Oct 23, 2013 at 8:29 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

If you have 9 nodes, you should take care to set minimum master nodes at
least to 5 (half the cluster nodes plus 1)

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I just modified my settings so that it expects 9 nodes, recovers after 6,
minimum master is 5, and all nodes are eligible masters. After some
testing, I have the same problem. I shutdown 3 at a time, everything came
back up fine. I tested a large failure where I shutdown 6 / 9 nodes and
when I brought them all back up, now 2 won't join the cluster. The cluster
is green because all shards and replicas are ok again, but I have 2 missing
nodes that won't join.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It looks like after some time and restarting the failed nodes again they
were finally able to rejoin the cluster.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.