Cluster nodes doesn't reconnect

planckiii · July 8, 2013, 8:31am

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on different
data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes, because
test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind of
errors:
first_node: https://gist.github.com/planckiii/5947058
second_node: https://gist.github.com/planckiii/5947068

My Transport and Discover configurations on both nodes:
################################## Transport
##################################
transport.tcp.connect.timeout: 5s
################################## Discovery
##################################
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

imdhmd · July 8, 2013, 3:06pm

Hey planckiii,
*
*
*>> *it's some kind of short network freeze between es nodes
Based on your above statement, i assume that connection between the two
nodes is weak/unreliable.
In that case you should see if it helps increasing ping timeout, retries
values. Also, do you have any specific reason to disable multicast?

Also, if you have head plugin installed on both the nodes and when this
happens again, could you bring up the head site pages of both the nodes and
see if they are both becoming master and hence are not able to form back
into one cluster. This would be a case of split-brain problem.
To resolve this, you have two choices:

Exclude one of the nodes from becoming master, using the setting
node.master: false
Use zookeeper plugin to externalize master election
(GitHub - sonian/elasticsearch-zookeeper)
Imdad

On Monday, July 8, 2013 2:01:30 PM UTC+5:30, planckiii wrote:

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on different
data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes, because
test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind
of errors:
first_node: first_node, primary data center: · GitHub
second_node: second_node, secondary data center: · GitHub

My Transport and Discover configurations on both nodes:
################################## Transport
##################################
transport.tcp.connect.timeout: 5s
################################## Discovery
##################################
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

planckiii · July 9, 2013, 8:58pm

W dniu poniedziałek, 8 lipca 2013 17:06:56 UTC+2 użytkownik Imdad Ahmed
napisał:

Hey planckiii,

Hi, thanks for quick rep

*>> *it's some kind of short network freeze between es nodes
Based on your above statement, i assume that connection between the two
nodes is weak/unreliable.
In that case you should see if it helps increasing ping timeout, retries
values. Also, do you have any specific reason to disable multicast?

I tried with that:
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
Network freezes shouldn't be higher than few seconds so in theory that
should bo OK. About multicast - that are VMs behind internal NAT, multicast
couldn't work outside that NAT

Also, if you have head plugin installed on both the nodes and when this
happens again, could you bring up the head site pages of both the nodes and
see if they are both becoming master and hence are not able to form back
into one cluster. This would be a case of split-brain problem.
To resolve this, you have two choices:

Exclude one of the nodes from becoming master, using the setting
node.master: false

Use zookeeper plugin to externalize master election (
GitHub - sonian/elasticsearch-zookeeper)

I agree that probably it's split-brain problem after disconnect - but there
isn't any info in log-s about that I will check that on next failure.
Thank's for advice - i will update status of that problem.

Imdad

On Monday, July 8, 2013 2:01:30 PM UTC+5:30, planckiii wrote:

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on
different data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes, because
test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind
of errors:
first_node: first_node, primary data center: · GitHub
second_node: second_node, secondary data center: · GitHub

My Transport and Discover configurations on both nodes:
################################## Transport
##################################
transport.tcp.connect.timeout: 5s
################################## Discovery
##################################
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

spinscale · July 10, 2013, 8:45am

Hey,

running a cluster in a cross-data-center setup is generally not a good
idea. For example if you are using replicas, every indexing operation goes
to both data centers and returns only when both are finished. This will
introduce high latency to your system. The same is true for searches going
to several shards, which a shared across both data centers. If you can, try
to build a different sync mechanism than this kind of high-risk setup
(writing data to both systems, which are an independent cluster for itself,
maybe?).

--Alex

On Tue, Jul 9, 2013 at 10:58 PM, planckiii planckiii@gmail.com wrote:

W dniu poniedziałek, 8 lipca 2013 17:06:56 UTC+2 użytkownik Imdad Ahmed
napisał:

Hey planckiii,

Hi, thanks for quick rep

*>> *it's some kind of short network freeze between es nodes
Based on your above statement, i assume that connection between the two
nodes is weak/unreliable.
In that case you should see if it helps increasing ping timeout, retries
values. Also, do you have any specific reason to disable multicast?

I tried with that:
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_interval: 2s
discovery.zen.fd.ping_retries: 10
Network freezes shouldn't be higher than few seconds so in theory that
should bo OK. About multicast - that are VMs behind internal NAT, multicast
couldn't work outside that NAT

Also, if you have head plugin installed on both the nodes and when this
happens again, could you bring up the head site pages of both the nodes and
see if they are both becoming master and hence are not able to form back
into one cluster. This would be a case of split-brain problem.
To resolve this, you have two choices:

Exclude one of the nodes from becoming master, using the setting
node.master: false

Use zookeeper plugin to externalize master election (
https://github.com/sonian/**elasticsearch-zookeeper https://github.com/sonian/elasticsearch-zookeeper
)

I agree that probably it's split-brain problem after disconnect - but
there isn't any info in log-s about that I will check that on next
failure. Thank's for advice - i will update status of that problem.

Imdad

On Monday, July 8, 2013 2:01:30 PM UTC+5:30, planckiii wrote:

Hi,
I have setup of elasticsearch 0.90.0 with two nodes, each one on
different data center. From time to time cluster status goes "yellow":
{
"cluster_name" : "my_cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 10,
"active_shards" : 10,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 10
}

Probably it's some kind of short network freeze between es nodes,
because test which i run (at intervals of 15s):

nc -z -v -w 2 second_node 9300

Fri Jul 5 21:31:07 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!
Fri Jul 5 21:31:22 CEST 2013 nc: connect to second_node port 9300 (tcp)
timed out: Operation now in progress
Fri Jul 5 21:31:39 CEST 2013 Connection to second_node 9300 port [tcp/]
succeeded!

What is strange for me, es nodes couldn't reconnect and i have that kind
of errors:
first_node: https://gist.**github.com/planckiii/5947058 https://gist.github.com/planckiii/5947058
second_node: https://gist.**github.com/planckiii/5947068 https://gist.github.com/planckiii/5947068

My Transport and Discover configurations on both nodes:
################################## Transport
##################################
transport.tcp.connect.timeout: 5s
################################## Discovery
##################################
discovery.zen.ping_timeout: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_**interval: 2s
discovery.zen.fd.ping_retries: 10
discovery.zen.ping.multicast.**enabled: false

After one node reset everything goes OK and cluster is properly balanced:
{
"cluster_name" : "my_cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 10,
"active_shards" : 20,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Any ideas what could be wrong with my setup ?

Regards

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Node not connected Elasticsearch	4	11894	July 6, 2017
Elasticsearch nodes continually disconneting/reconnecting. Resulting in very high number of unassigned shards Elasticsearch	18	2657	September 3, 2020
Elasticsearch cluster: node not able to connect to cluster Elasticsearch	1	847	July 5, 2017
ElasticSearch 0.92 issue when stop Client Node Elasticsearch	1	331	July 6, 2017
Two of the twelve nodes not joining the cluster Elasticsearch	6	1780	July 5, 2017

Cluster nodes doesn't reconnect

nc -z -v -w 2 second_node 9300

nc -z -v -w 2 second_node 9300

nc -z -v -w 2 second_node 9300

nc -z -v -w 2 second_node 9300

Related topics