Right now I have two nodes in a cluster that are separated physically
between two data centers. We just set this up and are testing right now,
looking to make this work.
Problem:
My nodes are setup with unicast and they stop talking to each other every
couple of hours and turn yellow. A restart of either instance fixes this
and turns them green.
A poll to the status API shows:
{
"cluster_name" : "itmSearch0",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5
}
The log files show this when they turn yellow. You can see I did a restart
at 11:54 and they went green. Then at 14:06, they went yellow.
Node1
[2012-07-03 11:54:33,742][INFO ][transport ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.240.110.170:9300]}
[2012-07-03 11:54:37,218][INFO ][cluster.service ] [Atom-Smasher]
detected_master [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]], added {[Freak
of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],}, reason:
zen-disco-receive(from master [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]])
[2012-07-03 11:54:37,227][INFO ][discovery ] [Atom-Smasher]
itmSearch0/lcWP38JcTXK5UlQlFl-9bg
[2012-07-03 11:54:37,231][INFO ][http ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.240.110.170:9200]}
[2012-07-03 11:54:37,231][INFO ][node ] [Atom-Smasher]
{0.19.7}[26287]: started
[2012-07-03 14:06:51,192][INFO ][discovery.zen ] [Atom-Smasher]
master_left [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]], reason [do
not exists on master, act as master failure]
[2012-07-03 14:06:51,193][INFO ][cluster.service ] [Atom-Smasher]
master {new
[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],
previous [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]}, removed
{[Freak of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],},
reason: zen-disco-master_failed ([Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]])
Node2
[2012-07-03 11:54:29,811][INFO ][cluster.service ] [Freak of
Science] removed
{[Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]],}, reason:
zen-disco-node_left([Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]])
[2012-07-03 11:54:37,178][INFO ][cluster.service ] [Freak of
Science] added
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason: zen-disco-receive(join from
node[[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]])
[2012-07-03 14:06:50,751][INFO ][cluster.service ] [Freak of
Science] removed
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason:
zen-disco-node_failed([Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]),
reason transport disconnected (with verified connect)
Here is my config file. This is the same on both.
cluster.name: itmSearch0
path.data: /data/elasticsearch/data
path.work: /data/elasticsearch/tmp
path.logs: /data/elasticsearch/logs
discovery.zen.ping.multicast.enabled: true
discovery.zen.ping.unicast.hosts: ["search1.cvg", "search1.phx"]
transport.tcp.port: 9300
http.port: 9200
Any ideas on what I can do to prevent them from turning yellow? When they
are in the yellow state I can query the other server at that time, so the
connectivity is still there. I originally had multicast.enabled to false,
but tried enablling that to true just in case it made a difference (just a
shot in the dark as that doesn't seem like it should).
Configuration:
I have the default setup of number of replicas and shards at 1, 5. I'm
thinking that this doesn't make sense for what I want. Should I have it
setup so that there are no shards? Will that cause the data to basically
be duplicated?
Also, I'm thinking about putting haproxy (or similar) in front of each
node, so they clients can simply use the TCPTransport connection and I will
set them up to hit the respective local node first and then failover to the
remote node. That way under ideal circumstances they will not have to go
across the WAN.
Can this work? Or should I have two separate clusters on each datacenter
and somehow replicate the data between the two? Is there any current
documentation on this?
Thanks for your help!