Cluster nodes doesn't reconnect

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on different
data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes, because
test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/
]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind of
errors:
first_node: https://gist.github.com/planckiii/5947058
second_node: https://gist.github.com/planckiii/5947068

My Transport and Discover configurations on both nodes:
################################## Transport
##################################
transport.tcp.connect.timeout: 5s
################################## Discovery
##################################
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey planckiii,
*
*
*>> *it's some kind of short network freeze between es nodes
Based on your above statement, i assume that connection between the two
nodes is weak/unreliable.
In that case you should see if it helps increasing ping timeout, retries
values. Also, do you have any specific reason to disable multicast?

Also, if you have head plugin installed on both the nodes and when this
happens again, could you bring up the head site pages of both the nodes and
see if they are both becoming master and hence are not able to form back
into one cluster. This would be a case of split-brain problem.
To resolve this, you have two choices:

On Monday, July 8, 2013 2:01:30 PM UTC+5:30, planckiii wrote:

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on different
data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes, because
test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/
]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind
of errors:
first_node: https://gist.github.com/planckiii/5947058
second_node: https://gist.github.com/planckiii/5947068

My Transport and Discover configurations on both nodes:
################################## Transport
##################################
transport.tcp.connect.timeout: 5s
################################## Discovery
##################################
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

W dniu poniedziałek, 8 lipca 2013 17:06:56 UTC+2 użytkownik Imdad Ahmed
napisał:

Hey planckiii,

Hi, thanks for quick rep :slight_smile:

*>> *it's some kind of short network freeze between es nodes
Based on your above statement, i assume that connection between the two
nodes is weak/unreliable.
In that case you should see if it helps increasing ping timeout, retries
values. Also, do you have any specific reason to disable multicast?

I tried with that:
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
Network freezes shouldn't be higher than few seconds so in theory that
should bo OK. About multicast - that are VMs behind internal NAT, multicast
couldn't work outside that NAT :frowning:

Also, if you have head plugin installed on both the nodes and when this
happens again, could you bring up the head site pages of both the nodes and
see if they are both becoming master and hence are not able to form back
into one cluster. This would be a case of split-brain problem.
To resolve this, you have two choices:

I agree that probably it's split-brain problem after disconnect - but there
isn't any info in log-s about that :frowning: I will check that on next failure.
Thank's for advice - i will update status of that problem.

  • Imdad

On Monday, July 8, 2013 2:01:30 PM UTC+5:30, planckiii wrote:

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on
different data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes, because
test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/
]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind
of errors:
first_node: https://gist.github.com/planckiii/5947058
second_node: https://gist.github.com/planckiii/5947068

My Transport and Discover configurations on both nodes:
################################## Transport
##################################
transport.tcp.connect.timeout: 5s
################################## Discovery
##################################
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey,

running a cluster in a cross-data-center setup is generally not a good
idea. For example if you are using replicas, every indexing operation goes
to both data centers and returns only when both are finished. This will
introduce high latency to your system. The same is true for searches going
to several shards, which a shared across both data centers. If you can, try
to build a different sync mechanism than this kind of high-risk setup
(writing data to both systems, which are an independent cluster for itself,
maybe?).

--Alex

On Tue, Jul 9, 2013 at 10:58 PM, planckiii planckiii@gmail.com wrote:

W dniu poniedziałek, 8 lipca 2013 17:06:56 UTC+2 użytkownik Imdad Ahmed
napisał:

Hey planckiii,

Hi, thanks for quick rep :slight_smile:

*>> *it's some kind of short network freeze between es nodes
Based on your above statement, i assume that connection between the two
nodes is weak/unreliable.
In that case you should see if it helps increasing ping timeout, retries
values. Also, do you have any specific reason to disable multicast?

I tried with that:
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
Network freezes shouldn't be higher than few seconds so in theory that
should bo OK. About multicast - that are VMs behind internal NAT, multicast
couldn't work outside that NAT :frowning:

Also, if you have head plugin installed on both the nodes and when this
happens again, could you bring up the head site pages of both the nodes and
see if they are both becoming master and hence are not able to form back
into one cluster. This would be a case of split-brain problem.
To resolve this, you have two choices:

I agree that probably it's split-brain problem after disconnect - but
there isn't any info in log-s about that :frowning: I will check that on next
failure. Thank's for advice - i will update status of that problem.

  • Imdad

On Monday, July 8, 2013 2:01:30 PM UTC+5:30, planckiii wrote:

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on
different data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes,
because test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/
]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind
of errors:
first_node: https://gist.**github.com/planckiii/5947058https://gist.github.com/planckiii/5947058
second_node: https://gist.**github.com/planckiii/5947068https://gist.github.com/planckiii/5947068

My Transport and Discover configurations on both nodes:
################################## Transport
##############################
####
transport.tcp.connect.timeout: 5s
################################## Discovery
##############################
####
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_**interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.**enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.