Multi data-center nodes going yellow


(Dusty Doris) #1

Right now I have two nodes in a cluster that are separated physically
between two data centers. We just set this up and are testing right now,
looking to make this work.

Problem:

My nodes are setup with unicast and they stop talking to each other every
couple of hours and turn yellow. A restart of either instance fixes this
and turns them green.

A poll to the status API shows:

{
"cluster_name" : "itmSearch0",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5
}

The log files show this when they turn yellow. You can see I did a restart
at 11:54 and they went green. Then at 14:06, they went yellow.

Node1
[2012-07-03 11:54:33,742][INFO ][transport ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.240.110.170:9300]}
[2012-07-03 11:54:37,218][INFO ][cluster.service ] [Atom-Smasher]
detected_master [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]], added {[Freak
of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],}, reason:
zen-disco-receive(from master [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]])
[2012-07-03 11:54:37,227][INFO ][discovery ] [Atom-Smasher]
itmSearch0/lcWP38JcTXK5UlQlFl-9bg
[2012-07-03 11:54:37,231][INFO ][http ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.240.110.170:9200]}
[2012-07-03 11:54:37,231][INFO ][node ] [Atom-Smasher]
{0.19.7}[26287]: started
[2012-07-03 14:06:51,192][INFO ][discovery.zen ] [Atom-Smasher]
master_left [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]], reason [do
not exists on master, act as master failure]
[2012-07-03 14:06:51,193][INFO ][cluster.service ] [Atom-Smasher]
master {new
[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],
previous [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]}, removed
{[Freak of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],},
reason: zen-disco-master_failed ([Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]])

Node2
[2012-07-03 11:54:29,811][INFO ][cluster.service ] [Freak of
Science] removed
{[Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]],}, reason:
zen-disco-node_left([Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]])
[2012-07-03 11:54:37,178][INFO ][cluster.service ] [Freak of
Science] added
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason: zen-disco-receive(join from
node[[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]])
[2012-07-03 14:06:50,751][INFO ][cluster.service ] [Freak of
Science] removed
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason:
zen-disco-node_failed([Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]),
reason transport disconnected (with verified connect)

Here is my config file. This is the same on both.

cluster.name: itmSearch0
path.data: /data/elasticsearch/data
path.work: /data/elasticsearch/tmp
path.logs: /data/elasticsearch/logs
discovery.zen.ping.multicast.enabled: true
discovery.zen.ping.unicast.hosts: ["search1.cvg", "search1.phx"]
transport.tcp.port: 9300
http.port: 9200

Any ideas on what I can do to prevent them from turning yellow? When they
are in the yellow state I can query the other server at that time, so the
connectivity is still there. I originally had multicast.enabled to false,
but tried enablling that to true just in case it made a difference (just a
shot in the dark as that doesn't seem like it should).

Configuration:

I have the default setup of number of replicas and shards at 1, 5. I'm
thinking that this doesn't make sense for what I want. Should I have it
setup so that there are no shards? Will that cause the data to basically
be duplicated?

Also, I'm thinking about putting haproxy (or similar) in front of each
node, so they clients can simply use the TCPTransport connection and I will
set them up to hit the respective local node first and then failover to the
remote node. That way under ideal circumstances they will not have to go
across the WAN.

Can this work? Or should I have two separate clusters on each datacenter
and somehow replicate the data between the two? Is there any current
documentation on this?

Thanks for your help!


(Dusty Doris) #2

I watched this fantastic video: http://vimeo.com/26710663# and now I
understand the shard/replica concepts better now. I also see my question
about 1/5 shard can be ignored.

I am still having issues with the two node cluster going yellow
periodically, if anyone has tips on that I'd appreciate it. Until
multi-datacenter awareness is added, it looks like I'll be stuck with my
clients needed to go over the connection between the two. Luckily we don't
seem to have problems with that. But, I would like to keep those nodes in
green if possible.

Thanks for any tips.

On Tuesday, July 3, 2012 2:53:57 PM UTC-4, Dusty Doris wrote:

Right now I have two nodes in a cluster that are separated physically
between two data centers. We just set this up and are testing right now,
looking to make this work.

Problem:

My nodes are setup with unicast and they stop talking to each other every
couple of hours and turn yellow. A restart of either instance fixes this
and turns them green.

A poll to the status API shows:

{
"cluster_name" : "itmSearch0",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5
}

The log files show this when they turn yellow. You can see I did a
restart at 11:54 and they went green. Then at 14:06, they went yellow.

Node1
[2012-07-03 11:54:33,742][INFO ][transport ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
10.240.110.170:9300]}
[2012-07-03 11:54:37,218][INFO ][cluster.service ] [Atom-Smasher]
detected_master [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]], added {[Freak
of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],}, reason:
zen-disco-receive(from master [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]])
[2012-07-03 11:54:37,227][INFO ][discovery ] [Atom-Smasher]
itmSearch0/lcWP38JcTXK5UlQlFl-9bg
[2012-07-03 11:54:37,231][INFO ][http ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
10.240.110.170:9200]}
[2012-07-03 11:54:37,231][INFO ][node ] [Atom-Smasher]
{0.19.7}[26287]: started
[2012-07-03 14:06:51,192][INFO ][discovery.zen ] [Atom-Smasher]
master_left [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]], reason [do
not exists on master, act as master failure]
[2012-07-03 14:06:51,193][INFO ][cluster.service ] [Atom-Smasher]
master {new
[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],
previous [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]}, removed
{[Freak of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],},
reason: zen-disco-master_failed ([Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]])

Node2
[2012-07-03 11:54:29,811][INFO ][cluster.service ] [Freak of
Science] removed
{[Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]],}, reason:
zen-disco-node_left([Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]])
[2012-07-03 11:54:37,178][INFO ][cluster.service ] [Freak of
Science] added
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason: zen-disco-receive(join from
node[[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]])
[2012-07-03 14:06:50,751][INFO ][cluster.service ] [Freak of
Science] removed
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason:
zen-disco-node_failed([Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]),
reason transport disconnected (with verified connect)

Here is my config file. This is the same on both.

cluster.name: itmSearch0
path.data: /data/elasticsearch/data
path.work: /data/elasticsearch/tmp
path.logs: /data/elasticsearch/logs
discovery.zen.ping.multicast.enabled: true
discovery.zen.ping.unicast.hosts: ["search1.cvg", "search1.phx"]
transport.tcp.port: 9300
http.port: 9200

Any ideas on what I can do to prevent them from turning yellow? When they
are in the yellow state I can query the other server at that time, so the
connectivity is still there. I originally had multicast.enabled to false,
but tried enablling that to true just in case it made a difference (just a
shot in the dark as that doesn't seem like it should).

Configuration:

I have the default setup of number of replicas and shards at 1, 5. I'm
thinking that this doesn't make sense for what I want. Should I have it
setup so that there are no shards? Will that cause the data to basically
be duplicated?

Also, I'm thinking about putting haproxy (or similar) in front of each
node, so they clients can simply use the TCPTransport connection and I will
set them up to hit the respective local node first and then failover to the
remote node. That way under ideal circumstances they will not have to go
across the WAN.

Can this work? Or should I have two separate clusters on each datacenter
and somehow replicate the data between the two? Is there any current
documentation on this?

Thanks for your help!


(gepo) #3

I am facing same "problem" with how to set up ES in a two datacenters.

Would be interesting to know how you finally solved it !!!

kr
Georges

Den tisdagen den 3:e juli 2012 kl. 20:53:57 UTC+2 skrev Dusty Doris:

Right now I have two nodes in a cluster that are separated physically
between two data centers. We just set this up and are testing right now,
looking to make this work.

Problem:

My nodes are setup with unicast and they stop talking to each other every
couple of hours and turn yellow. A restart of either instance fixes this
and turns them green.

A poll to the status API shows:

{
"cluster_name" : "itmSearch0",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5
}

The log files show this when they turn yellow. You can see I did a
restart at 11:54 and they went green. Then at 14:06, they went yellow.

Node1
[2012-07-03 11:54:33,742][INFO ][transport ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/
10.240.110.170:9300]}
[2012-07-03 11:54:37,218][INFO ][cluster.service ] [Atom-Smasher]
detected_master [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]], added {[Freak
of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],}, reason:
zen-disco-receive(from master [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]])
[2012-07-03 11:54:37,227][INFO ][discovery ] [Atom-Smasher]
itmSearch0/lcWP38JcTXK5UlQlFl-9bg
[2012-07-03 11:54:37,231][INFO ][http ] [Atom-Smasher]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/
10.240.110.170:9200]}
[2012-07-03 11:54:37,231][INFO ][node ] [Atom-Smasher]
{0.19.7}[26287]: started
[2012-07-03 14:06:51,192][INFO ][discovery.zen ] [Atom-Smasher]
master_left [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]], reason [do
not exists on master, act as master failure]
[2012-07-03 14:06:51,193][INFO ][cluster.service ] [Atom-Smasher]
master {new
[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],
previous [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]}, removed
{[Freak of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],},
reason: zen-disco-master_failed ([Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]])

Node2
[2012-07-03 11:54:29,811][INFO ][cluster.service ] [Freak of
Science] removed
{[Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]],}, reason:
zen-disco-node_left([Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]])
[2012-07-03 11:54:37,178][INFO ][cluster.service ] [Freak of
Science] added
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason: zen-disco-receive(join from
node[[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]])
[2012-07-03 14:06:50,751][INFO ][cluster.service ] [Freak of
Science] removed
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason:
zen-disco-node_failed([Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]),
reason transport disconnected (with verified connect)

Here is my config file. This is the same on both.

cluster.name: itmSearch0
path.data: /data/elasticsearch/data
path.work: /data/elasticsearch/tmp
path.logs: /data/elasticsearch/logs
discovery.zen.ping.multicast.enabled: true
discovery.zen.ping.unicast.hosts: ["search1.cvg", "search1.phx"]
transport.tcp.port: 9300
http.port: 9200

Any ideas on what I can do to prevent them from turning yellow? When they
are in the yellow state I can query the other server at that time, so the
connectivity is still there. I originally had multicast.enabled to false,
but tried enablling that to true just in case it made a difference (just a
shot in the dark as that doesn't seem like it should).

Configuration:

I have the default setup of number of replicas and shards at 1, 5. I'm
thinking that this doesn't make sense for what I want. Should I have it
setup so that there are no shards? Will that cause the data to basically
be duplicated?

Also, I'm thinking about putting haproxy (or similar) in front of each
node, so they clients can simply use the TCPTransport connection and I will
set them up to hit the respective local node first and then failover to the
remote node. That way under ideal circumstances they will not have to go
across the WAN.

Can this work? Or should I have two separate clusters on each datacenter
and somehow replicate the data between the two? Is there any current
documentation on this?

Thanks for your help!

--


(Dusty Doris) #4

It turns out that it was a tcp timeout problem for me.

I added a file to /etc/sysctl.d/ that contained

net.ipv4.tcp_keepalive_time = 1800

On Wednesday, September 12, 2012 3:44:24 AM UTC-4, gepo wrote:

I am facing same "problem" with how to set up ES in a two datacenters.

Would be interesting to know how you finally solved it !!!

kr
Georges

Den tisdagen den 3:e juli 2012 kl. 20:53:57 UTC+2 skrev Dusty Doris:

Right now I have two nodes in a cluster that are separated physically
between two data centers. We just set this up and are testing right now,
looking to make this work.

Problem:

My nodes are setup with unicast and they stop talking to each other every
couple of hours and turn yellow. A restart of either instance fixes this
and turns them green.

A poll to the status API shows:

{
"cluster_name" : "itmSearch0",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 5
}

The log files show this when they turn yellow. You can see I did a
restart at 11:54 and they went green. Then at 14:06, they went yellow.

Node1
[2012-07-03 11:54:33,742][INFO ][transport ]
[Atom-Smasher] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.240.110.170:9300]}
[2012-07-03 11:54:37,218][INFO ][cluster.service ]
[Atom-Smasher] detected_master [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]], added {[Freak
of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],}, reason:
zen-disco-receive(from master [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]])
[2012-07-03 11:54:37,227][INFO ][discovery ]
[Atom-Smasher] itmSearch0/lcWP38JcTXK5UlQlFl-9bg
[2012-07-03 11:54:37,231][INFO ][http ]
[Atom-Smasher] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.240.110.170:9200]}
[2012-07-03 11:54:37,231][INFO ][node ]
[Atom-Smasher] {0.19.7}[26287]: started
[2012-07-03 14:06:51,192][INFO ][discovery.zen ]
[Atom-Smasher] master_left [[Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]], reason [do
not exists on master, act as master failure]
[2012-07-03 14:06:51,193][INFO ][cluster.service ]
[Atom-Smasher] master {new
[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],
previous [Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]]}, removed
{[Freak of Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]],},
reason: zen-disco-master_failed ([Freak of
Science][9UmzsosBQJ6_NsM69Bka4Q][inet[/10.240.176.170:9300]])

Node2
[2012-07-03 11:54:29,811][INFO ][cluster.service ] [Freak of
Science] removed
{[Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]],}, reason:
zen-disco-node_left([Fin][T-R-mth1T8CmSfuyDIf0lQ][inet[/10.240.110.170:9300]])
[2012-07-03 11:54:37,178][INFO ][cluster.service ] [Freak of
Science] added
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason: zen-disco-receive(join from
node[[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]])
[2012-07-03 14:06:50,751][INFO ][cluster.service ] [Freak of
Science] removed
{[Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]],},
reason:
zen-disco-node_failed([Atom-Smasher][lcWP38JcTXK5UlQlFl-9bg][inet[/10.240.110.170:9300]]),
reason transport disconnected (with verified connect)

Here is my config file. This is the same on both.

cluster.name: itmSearch0
path.data: /data/elasticsearch/data
path.work: /data/elasticsearch/tmp
path.logs: /data/elasticsearch/logs
discovery.zen.ping.multicast.enabled: true
discovery.zen.ping.unicast.hosts: ["search1.cvg", "search1.phx"]
transport.tcp.port: 9300
http.port: 9200

Any ideas on what I can do to prevent them from turning yellow? When
they are in the yellow state I can query the other server at that time, so
the connectivity is still there. I originally had multicast.enabled to
false, but tried enablling that to true just in case it made a difference
(just a shot in the dark as that doesn't seem like it should).

Configuration:

I have the default setup of number of replicas and shards at 1, 5. I'm
thinking that this doesn't make sense for what I want. Should I have it
setup so that there are no shards? Will that cause the data to basically
be duplicated?

Also, I'm thinking about putting haproxy (or similar) in front of each
node, so they clients can simply use the TCPTransport connection and I will
set them up to hit the respective local node first and then failover to the
remote node. That way under ideal circumstances they will not have to go
across the WAN.

Can this work? Or should I have two separate clusters on each datacenter
and somehow replicate the data between the two? Is there any current
documentation on this?

Thanks for your help!

--


(system) #5