Cluster connection issues when the machines hosting the nodes are restarted for service maintanance

Mxims_2 · June 15, 2013, 1:11am

We currently have ElasticSearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece of
code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

@ 18:08 MS084 node was stopped
@ 18:13 MS084 was back online, and started the service, at this time
the node discovered the other node MS095, and added to the cluster
@ 18:57 MS095 node was stopped
@ 18:59 MS095 node was back online and initialized and started. At
this time, this node did not discover the other node MS084. So the cluster
failed
@ 19:00 onward you can see that the search requests started error out
not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search and
indexing requests started working alright!

I am guessing there is something I messed up in the settings of the cluster
on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mxims_2 · June 17, 2013, 11:39pm

Guys, any thoughts on this?

On Friday, June 14, 2013 6:11:45 PM UTC-7, Mxims wrote:

We currently have Elasticsearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece
of code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

@ 18:08 MS084 node was stopped

@ 18:13 MS084 was back online, and started the service, at this time
the node discovered the other node MS095, and added to the cluster

@ 18:57 MS095 node was stopped

@ 18:59 MS095 node was back online and initialized and started. At
this time, this node did not discover the other node MS084. So the cluster
failed

@ 19:00 onward you can see that the search requests started error
out - not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search
and indexing requests started working alright!

I am guessing there is something I messed up in the settings of the
cluster on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mxims_2 · June 25, 2013, 6:38pm

Same issue happened this month's maintenance weekend! Guys, any help on
this?

[2013-06-22 18:06:44,705][INFO ][node ] [MS095]
{0.20.5}[1832]: stopping ...
[2013-06-22 18:06:45,141][INFO ][node ] [MS095]
{0.20.5}[1832]: stopped
[2013-06-22 18:06:45,141][INFO ][node ] [MS095]
{0.20.5}[1832]: closing ...
[2013-06-22 18:06:45,157][INFO ][node ] [MS095]
{0.20.5}[1832]: closed
[2013-06-22 18:09:23,770][INFO ][node ] [MS095]
{0.20.5}[1948]: initializing ...
[2013-06-22 18:09:23,833][INFO ][plugins ] [MS095] loaded
, sites [head]
[2013-06-22 18:09:32,647][INFO ][node ] [MS095]
{0.20.5}[1948]: initialized
[2013-06-22 18:09:32,647][INFO ][node ] [MS095]
{0.20.5}[1948]: starting ...
[2013-06-22 18:09:32,850][INFO ][transport ] [MS095]
bound_address {inet[/149.59.13.95:9300]}, publish_address
{inet[/149.59.13.95:9300]}
[2013-06-22 18:09:53,972][INFO ][cluster.service ] [MS095]
new_master [MS095][qI6pqS46Q82e-a-k4fDNaw][inet[/149.59.13.95:9300]],
reason: zen-disco-join (elected_as_master)
[2013-06-22 18:09:53,988][INFO ][discovery ] [MS095]
elasticsearch/qI6pqS46Q82e-a-k4fDNaw
[2013-06-22 18:09:54,034][INFO ][http ] [MS095]
bound_address {inet[/149.59.13.95:7105]}, publish_address
{inet[/149.59.13.95:7105]}
[2013-06-22 18:09:54,034][INFO ][node ] [MS095]
{0.20.5}[1948]: started
[2013-06-22 18:09:56,733][INFO ][gateway ] [MS095]
recovered [1] indices into cluster_state
[2013-06-22 20:02:57,956][DEBUG][action.search.type ] [MS095] Node
[TQg2TmdsRWGugg1OSxyhJA] not available for scroll request
[scan;5;5:TQg2TmdsRWGugg1OSxyhJA;1:TQg2TmdsRWGugg1OSxyhJA;2:TQg2TmdsRWGugg1OSxyhJA;3:TQg2TmdsRWGugg1OSxyhJA;4:TQg2TmdsRWGugg1OSxyhJA;1;total_hits:1558;]

====================================================================================

[2013-06-22 18:06:44,705][INFO ][discovery.zen ] [MS084]
master_left [[MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]]],
reason [shut_down]
[2013-06-22 18:06:44,727][INFO ][cluster.service ] [MS084] master
{new [MS084][SIpZ4d0kR4CUEVAMzrcQCg][inet[/149.59.13.184:9300]], previous
[MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]]}, removed
{[MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]],}, reason:
zen-disco-master_failed
([MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]])
[2013-06-22 18:08:38,229][INFO ][node ] [MS084]
{0.20.5}[19968]: stopping ...
[2013-06-22 18:08:38,399][INFO ][node ] [MS084]
{0.20.5}[19968]: stopped
[2013-06-22 18:08:38,399][INFO ][node ] [MS084]
{0.20.5}[19968]: closing ...
[2013-06-22 18:08:38,421][INFO ][node ] [MS084]
{0.20.5}[19968]: closed
[2013-06-22 18:13:09,024][INFO ][node ] [MS084]
{0.20.5}[1832]: initializing ...
[2013-06-22 18:13:09,079][INFO ][plugins ] [MS084] loaded
, sites [head]
[2013-06-22 18:13:13,895][INFO ][node ] [MS084]
{0.20.5}[1832]: initialized
[2013-06-22 18:13:13,896][INFO ][node ] [MS084]
{0.20.5}[1832]: starting ...
[2013-06-22 18:13:14,048][INFO ][transport ] [MS084]
bound_address {inet[/149.59.13.184:9300]}, publish_address
{inet[/149.59.13.184:9300]}
[2013-06-22 18:13:35,113][INFO ][cluster.service ] [MS084]
new_master [MS084][TQg2TmdsRWGugg1OSxyhJA][inet[/149.59.13.184:9300]],
reason: zen-disco-join (elected_as_master)
[2013-06-22 18:13:35,140][INFO ][discovery ] [MS084]
elasticsearch/TQg2TmdsRWGugg1OSxyhJA
[2013-06-22 18:13:35,184][INFO ][http ] [MS084]
bound_address {inet[/149.59.13.184:7105]}, publish_address
{inet[/149.59.13.184:7105]}
[2013-06-22 18:13:35,184][INFO ][node ] [MS084]
{0.20.5}[1832]: started
[2013-06-22 18:13:36,735][INFO ][gateway ] [MS084]
recovered [1] indices into cluster_state
[2013-06-22 18:13:59,584][DEBUG][action.search.type ] [MS084] Node
[qI6pqS46Q82e-a-k4fDNaw] not available for scroll request
[scan;5;26:qI6pqS46Q82e-a-k4fDNaw;28:qI6pqS46Q82e-a-k4fDNaw;27:qI6pqS46Q82e-a-k4fDNaw;30:qI6pqS46Q82e-a-k4fDNaw;29:qI6pqS46Q82e-a-k4fDNaw;1;total_hits:1558;]

On Friday, June 14, 2013 6:11:45 PM UTC-7, Mxims wrote:

We currently have Elasticsearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece
of code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

@ 18:08 MS084 node was stopped

@ 18:13 MS084 was back online, and started the service, at this time
the node discovered the other node MS095, and added to the cluster

@ 18:57 MS095 node was stopped

@ 18:59 MS095 node was back online and initialized and started. At
this time, this node did not discover the other node MS084. So the cluster
failed

@ 19:00 onward you can see that the search requests started error
out - not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search
and indexing requests started working alright!

I am guessing there is something I messed up in the settings of the
cluster on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Christian_Th · June 25, 2013, 7:50pm

Does the following configurationhelp ?

Set to ensure a node sees N other master eligible nodes to be considered

operational within the cluster. Set this option to a higher value (2-4)

for large clusters (>3 nodes):

discovery.zen.minimum_master_nodes: 2

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Christian_Th · June 25, 2013, 7:51pm

Does the following configuration help ?

Set to ensure a node sees N other master eligible nodes to be considered

operational within the cluster. Set this option to a higher value (2-4)

for large clusters (>3 nodes):

discovery.zen.minimum_master_nodes: 2

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Christian_Th · June 25, 2013, 7:54pm

I reproduced it without the configuration

discovery.zen.minimum_master_nodes: 2

After inserting it, it worked for me

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Mxims_2 · June 26, 2013, 5:55am

Hi Christian,

Thank you so much for your help.

I have configured it on both the nodes in production, and restarted the
nodes. Hopefully it will not fail anymore.

Thank you so much,
Renjith

On Friday, June 14, 2013 6:11:45 PM UTC-7, Mxims wrote:

We currently have Elasticsearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece
of code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

@ 18:08 MS084 node was stopped

@ 18:13 MS084 was back online, and started the service, at this time
the node discovered the other node MS095, and added to the cluster

@ 18:57 MS095 node was stopped

@ 18:59 MS095 node was back online and initialized and started. At
this time, this node did not discover the other node MS084. So the cluster
failed

@ 19:00 onward you can see that the search requests started error
out - not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search
and indexing requests started working alright!

I am guessing there is something I messed up in the settings of the
cluster on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Elastic search nodes only cluster together when started within a few minutes of eachother Elasticsearch	3	481	November 29, 2012
Cluster nodes doesn't reconnect Elasticsearch	3	1827	July 10, 2013
Cluster Setup 3 Node Cluster problem Elasticsearch	47	2377	July 15, 2019
ElasticSearch cannot join cluster Elasticsearch	6	646	August 16, 2012
Cluster health times out Elasticsearch	17	1740	August 13, 2012

Cluster connection issues when the machines hosting the nodes are restarted for service maintanance

Set to ensure a node sees N other master eligible nodes to be considered

operational within the cluster. Set this option to a higher value (2-4)

for large clusters (>3 nodes):

Set to ensure a node sees N other master eligible nodes to be considered

operational within the cluster. Set this option to a higher value (2-4)

for large clusters (>3 nodes):

Related topics