Node does not join the cluster

I am trying to implement failover in cluster with two nodes. When I test
the configuration of two nodes (two tomcat servers, each running a Web app
that starts an embedded ES server) running on the same machine, the
reliably work. The two nodes cluster together and the cluster status is
green.

{

  • cluster_name: "elasticsearch",
  • status: "green",
  • timed_out: false,
  • number_of_nodes: 2,
  • number_of_data_nodes: 2,
  • active_primary_shards: 5,
  • active_shards: 10,
  • relocating_shards: 0,
  • initializing_shards: 0,
  • unassigned_shards: 0

}

Starting the same two servers on two different machines sometimes results
in a green cluster, but more often than not, the two start as two separate
instances, both with status yellow.

During a successful startup attempt when the second server of the cluster
starts up there is a lengthy handshake (I enabled Java debug-level logging
for package org.elasticseach):

2012-10-18 14:37:15,200 EDT [elasticsearch[Killmonger, Erik][generic][T#9]]

DEBUG[org.elasticsearch.transport.netty]: [Killmonger, Erik] connected to
node [[Anomalito][XqHrYxdEQIagjLGVWokCwQ][inet[/192.168.35.137:9300]]]

2012-10-18 14:37:21,922 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.cluster.service]: [Killmonger, Erik] processing
[zen-disco-receive(join from
node[[Anomalito][XqHrYxdEQIagjLGVWokCwQ][inet[/192.168.35.137:9300]]])]:
execute

2012-10-18 14:37:21,976 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.cluster.service]: [Killmonger, Erik] cluster state
updated, version [8], source [zen-disco-receive(join from
node[[Anomalito][XqHrYxdEQIagjLGVWokCwQ][inet[/192.168.35.137:9300]]])]

2012-10-18 14:37:21,977 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]] INFO
[org.elasticsearch.cluster.service]: [Killmonger, Erik] added
{[Anomalito][XqHrYxdEQIagjLGVWokCwQ][inet[/192.168.35.137:9300]],}, reason:
zen-disco-receive(join from
node[[Anomalito][XqHrYxdEQIagjLGVWokCwQ][inet[/192.168.35.137:9300]]])

2012-10-18 14:37:22,043 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.cluster.service]: [Killmonger, Erik] processing
[zen-disco-receive(join from
node[[Anomalito][XqHrYxdEQIagjLGVWokCwQ][inet[/192.168.35.137:9300]]])]:
done applying updated cluster_state

2012-10-18 14:37:22,043 EDT [elasticsearch[Killmonger,

Erik][riverClusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.river.cluster]: [Killmonger, Erik] processing
[reroute_rivers_node_changed]: execute

2012-10-18 14:37:22,044 EDT [elasticsearch[Killmonger,

Erik][riverClusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.river.cluster]: [Killmonger, Erik] processing
[reroute_rivers_node_changed]: no change in cluster_state

2012-10-18 14:37:24,902 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.cluster.service]: [Killmonger, Erik] processing
[routing-table-updater]: execute

2012-10-18 14:37:25,122 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.cluster.service]: [Killmonger, Erik] cluster state
updated, version [9], source [routing-table-updater]

2012-10-18 14:37:25,150 EDT [elasticsearch[Killmonger,

Erik][riverClusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.river.cluster]: [Killmonger, Erik] processing
[reroute_rivers_node_changed]: execute

2012-10-18 14:37:25,150 EDT [elasticsearch[Killmonger,

Erik][riverClusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.river.cluster]: [Killmonger, Erik] processing
[reroute_rivers_node_changed]: no change in cluster_state

2012-10-18 14:37:25,179 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.cluster.service]: [Killmonger, Erik] processing
[routing-table-updater]: done applying updated cluster_state

2012-10-18 14:37:32,667 EDT [elasticsearch[Killmonger,

Erik][clusterService#updateTask][T#1]]
DEBUG[org.elasticsearch.cluster.service]: [Killmonger, Erik] processing
[create-index [marketplace], cause [api]]: execute

...

When I shut down one of the servers and then restart it, more often then
not, the tomcat server starts up successfully, but the second ES server
does not form the cluster with the running node. The two ES servers have
yellow status (since replication factor is 1), but otherwise appear
functional. No error messages are displayed. Instead of the lengthy stream
of messages you see during a successful startup only a single message is
emitted by the master:

2012-10-18 15:11:36,268 EDT [elasticsearch[Candra][generic][T#11]]

DEBUG[org.elasticsearch.transport.netty]: [Candra] connected to node
[[Jester][D9vXW5OrSUSrwuDaR0SIVA][inet[/192.168.35.137:9300]]]

This message seems to imply that the two servers connected, but they do not
forma cluster.

Connecting to each server in the cluster gives me this status:

{

  • cluster_name: "elasticsearch",
  • status: "yellow",
  • timed_out: false,
  • number_of_nodes: 1,
  • number_of_data_nodes: 1,
  • active_primary_shards: 5,
  • active_shards: 5,
  • relocating_shards: 0,
  • initializing_shards: 0,
  • unassigned_shards: 5

}

Both Candra and Jester run, but as two standalone ES servers.

What could be the cause of this occasional failure to connect? How do I
debug it?

Thank you,

Michael

--

I noticed that in case of faulty startup the lone message about server
connection is followed by messages from Tomcat:

2012-10-18 17:32:30,813 EDT [elasticsearch[Berzerker][generic][T#7]]

DEBUG[org.elasticsearch.transport.netty]: [Berzerker] connected to node
[[Dreadknight][fqLtglKzTJ6uit2aYA_YAw][inet[/192.168.35.137:9300]]]

2012-10-18 17:32:54,160 [Membership-MemberExpired.] INFO

org.apache.catalina.tribes.group.interceptors.TcpFailureDetector- Received
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127,
0, 0, 1}:4001,{127, 0, 0, 1},4001, alive=95012, securePort=-1, UDP Port=-1,
id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 },
payload={}, command={}, domain={}, ]] message. Will verify.

2012-10-18 17:32:54,162 [Membership-MemberExpired.] INFO

org.apache.catalina.tribes.group.interceptors.TcpFailureDetector-
Verification complete. Member still
alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127, 0, 0,
1}:4001,{127, 0, 0, 1},4001, alive=95012, securePort=-1, UDP Port=-1,
id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 },
payload={}, command={}, domain={}, ]]

2012-10-18 17:32:57,675 [Membership-MemberExpired.] INFO

org.apache.catalina.tribes.group.interceptors.TcpFailureDetector- Received
memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127,
0, 0, 1}:4001,{127, 0, 0, 1},4001, alive=98395, securePort=-1, UDP Port=-1,
id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 },
payload={}, command={}, domain={}, ]] message. Will verify.

2012-10-18 17:32:57,680 [Membership-MemberExpired.] INFO

org.apache.catalina.tribes.group.interceptors.TcpFailureDetector-
Verification complete. Member still
alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127, 0, 0,
1}:4001,{127, 0, 0, 1},4001, alive=98395, securePort=-1, UDP Port=-1,
id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 },
payload={}, command={}, domain={}, ]]

I wonder whether there is a relationship between memberDisappeared message
from Tomcat and inability of Elasticsearch nodes to form a cluster.

Michael

--

I am not really sure, but you can turn on the discovery (or org.elasticsearch.discovery if running embedded) logging to see why it fails to discovery the other node.

On Oct 18, 2012, at 11:32 PM, Michael Smolyak msmolyak.nextcentury@gmail.com wrote:

I noticed that in case of faulty startup the lone message about server connection is followed by messages from Tomcat:

2012-10-18 17:32:30,813 EDT [elasticsearch[Berzerker][generic][T#7]] DEBUG[org.elasticsearch.transport.netty]: [Berzerker] connected to node [[Dreadknight][fqLtglKzTJ6uit2aYA_YAw][inet[/192.168.35.137:9300]]]
2012-10-18 17:32:54,160 [Membership-MemberExpired.] INFO org.apache.catalina.tribes.group.interceptors.TcpFailureDetector- Received memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127, 0, 0, 1}:4001,{127, 0, 0, 1},4001, alive=95012, securePort=-1, UDP Port=-1, id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 }, payload={}, command={}, domain={}, ]] message. Will verify.
2012-10-18 17:32:54,162 [Membership-MemberExpired.] INFO org.apache.catalina.tribes.group.interceptors.TcpFailureDetector- Verification complete. Member still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127, 0, 0, 1}:4001,{127, 0, 0, 1},4001, alive=95012, securePort=-1, UDP Port=-1, id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 }, payload={}, command={}, domain={}, ]]
2012-10-18 17:32:57,675 [Membership-MemberExpired.] INFO org.apache.catalina.tribes.group.interceptors.TcpFailureDetector- Received memberDisappeared[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127, 0, 0, 1}:4001,{127, 0, 0, 1},4001, alive=98395, securePort=-1, UDP Port=-1, id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 }, payload={}, command={}, domain={}, ]] message. Will verify.
2012-10-18 17:32:57,680 [Membership-MemberExpired.] INFO org.apache.catalina.tribes.group.interceptors.TcpFailureDetector- Verification complete. Member still alive[org.apache.catalina.tribes.membership.MemberImpl[tcp://{127, 0, 0, 1}:4001,{127, 0, 0, 1},4001, alive=98395, securePort=-1, UDP Port=-1, id={-4 33 -44 -69 -27 -103 68 -21 -88 -63 -92 120 -64 -55 -100 -36 }, payload={}, command={}, domain={}, ]]

I wonder whether there is a relationship between memberDisappeared message from Tomcat and inability of Elasticsearch nodes to form a cluster.

Michael

--

--