Is it possible to get the tribe cluster/master to retry the initial connection to a child node/cluster, even after having established an connection to the rest of the tribe?
If the child node/cluster is offline when the tribe cluster starts, it logs this:
[2015-08-11 14:01:34,114][WARN ][discovery ] [devcluster-tribe/t2] waited for 30s and no initial state was set by the discovery
... gives up (continuing with the other child nodes/clusters) - and never tries to reconnect.
This makes the startup sequence of a tribe solution critical, as it requires that all sub-clusters are up and available when the tribe cluster starts.
We are hoping to tie queries across 12 regional data centers together by using tribe nodes, but this make it harder to do maintenance, as it will require a restart of tribe clusters if sub-clusters were unavailable upon startup.
I am currently experimenting with Tribe nodes and I do not think this is the case. When i start up a remote cluster after the Tribe node is already running it does try discover it (although the attempt is not reflected well by the logs on the tribe node). When the remote cluster is started up i see this in the logs after recovery is completed:
[2015-08-27 10:16:41,051][INFO ][http ] [Ithil] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/...:9200]}
[2015-08-27 10:16:41,064][INFO ][node ] [Ithil] started
[2015-08-27 10:16:41,062][INFO ][gateway ] [Ithil] recovered [1] indices into cluster_state
[2015-08-27 10:16:41,588][INFO ][watcher ] [Ithil] watch service has started [2015-08-27 10:16:48,103][INFO ][cluster.service ] [Ithil] added {[Tribemaster/t2][DuNkt5xiQWiIvk99pkySGg][Osgiliath][inet[/...:9303]]{data=false, client=true},}, reason: zen-disco-receive(join from node[[Tribemaster/t2][DuNkt5xiQWiIvk99pkySGg][Osgiliath][inet[/...:9303]]{data=false, client=true}])
That is how I have it set up. On every node multicast is turned off (it wouldn't work on our network anyway). On the tribe node I have the tribes set up like this:
The two clusters are located on different continents. I am having a different issue myself, but that is unrelated to this thread.
EDIT1: One thing I forgot to mention: I use IP address in every config file and not DNS names, but that should not influence the behavior of discoveries.
EDIT2: I also did a Wireshark capture and i can confirm that the Tribe node keeps sending TCP keep alives even if it knows the node was unreachable. That is why it can rejoin it as soon as its cluster state goes back to normal.
Interesting.
One thing; you are sure you wait for the tribe master to give up initializing before you start the child node?
This is how it logs if I stop both the child-clusters before starting the tribe node.
It stays this way even if starting any of the child-clusters after the last "started".
[2015-08-27 23:10:19,811][INFO ][node ] [maeaint02-tribe] initialized
[2015-08-27 23:10:19,811][INFO ][node ] [maeaint02-tribe] starting ...
[2015-08-27 23:10:20,061][INFO ][transport ] [maeaint02-tribe] bound_address {inet[/0:0:0:0:0:0:0:0:9303]}, publish_address {inet[/x.x.x.x:9303]}
[2015-08-27 23:10:20,076][INFO ][discovery ] [maeaint02-tribe] devbridge-tribe/_VfZiOOBQear7ze5Vzku8w
[2015-08-27 23:10:20,076][WARN ][discovery ] [maeaint02-tribe] waited for 0s and no initial state was set by the discovery
[2015-08-27 23:10:20,154][INFO ][http ] [maeaint02-tribe] bound_address {inet[/0:0:0:0:0:0:0:0:9203]}, publish_address {inet[/x.x.x.x:9203]}
[2015-08-27 23:10:20,154][INFO ][node ] [maeaint02-tribe/t2] starting ...
[2015-08-27 23:10:20,310][INFO ][transport ] [maeaint02-tribe/t2] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/x.x.x.x:9301]}
[2015-08-27 23:10:20,435][INFO ][discovery ] [maeaint02-tribe/t2] tribe-test2/Y4yaAZOLThO7GyWmo3Zhsw
[2015-08-27 23:10:50,458][WARN ][discovery ] [maeaint02-tribe/t2] waited for 30s and no initial state was set by the discovery
[2015-08-27 23:10:50,458][INFO ][node ] [maeaint02-tribe/t2] started
[2015-08-27 23:10:50,458][INFO ][node ] [maeaint02-tribe/t1] starting ...
[2015-08-27 23:10:50,571][INFO ][transport ] [maeaint02-tribe/t1] bound_address {inet[/0:0:0:0:0:0:0:0:9302]}, publish_address {inet[/x.x.x.x:9302]}
[2015-08-27 23:10:50,680][INFO ][discovery ] [maeaint02-tribe/t1] tribe-test1/8vV9LSuzS76Ob54dTS-5kg
[2015-08-27 23:11:20,691][WARN ][discovery ] [maeaint02-tribe/t1] waited for 30s and no initial state was set by the discovery
[2015-08-27 23:11:20,691][INFO ][node ] [maeaint02-tribe/t1] started
[2015-08-27 23:11:20,691][INFO ][node ] [maeaint02-tribe] started
I actually managed to get it to work.
I had not noticed that the tribe clients themselves bound to local ports as well. Since I was testing with multiple nodes on the same machine, I had situations where my test nodes did not get the expected port number if the tribe node started first. (My initial tests were between datacenters though, so I don't know why I couldn't get it to work then).
So thanks for your commends; they made me try again and making sure the nodes had explicit port bindings this time.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.