Cluster connection issues when the machines hosting the nodes are restarted for service maintanance

We currently have ElasticSearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece of
code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

  • @ 18:08 MS084 node was stopped

  • @ 18:13 MS084 was back online, and started the service, at this time
    the node discovered the other node MS095, and added to the cluster

  • @ 18:57 MS095 node was stopped

  • @ 18:59 MS095 node was back online and initialized and started. At
    this time, this node did not discover the other node MS084. So the cluster
    failed

  • @ 19:00 onward you can see that the search requests started error out

  • not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search and
indexing requests started working alright!

I am guessing there is something I messed up in the settings of the cluster
on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Guys, any thoughts on this?

On Friday, June 14, 2013 6:11:45 PM UTC-7, Mxims wrote:

We currently have Elasticsearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece
of code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

  • @ 18:08 MS084 node was stopped

  • @ 18:13 MS084 was back online, and started the service, at this time
    the node discovered the other node MS095, and added to the cluster

  • @ 18:57 MS095 node was stopped

  • @ 18:59 MS095 node was back online and initialized and started. At
    this time, this node did not discover the other node MS084. So the cluster
    failed

  • @ 19:00 onward you can see that the search requests started error
    out - not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search
and indexing requests started working alright!

I am guessing there is something I messed up in the settings of the
cluster on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Same issue happened this month's maintenance weekend! Guys, any help on
this?

[2013-06-22 18:06:44,705][INFO ][node ] [MS095]
{0.20.5}[1832]: stopping ...
[2013-06-22 18:06:45,141][INFO ][node ] [MS095]
{0.20.5}[1832]: stopped
[2013-06-22 18:06:45,141][INFO ][node ] [MS095]
{0.20.5}[1832]: closing ...
[2013-06-22 18:06:45,157][INFO ][node ] [MS095]
{0.20.5}[1832]: closed
[2013-06-22 18:09:23,770][INFO ][node ] [MS095]
{0.20.5}[1948]: initializing ...
[2013-06-22 18:09:23,833][INFO ][plugins ] [MS095] loaded
, sites [head]
[2013-06-22 18:09:32,647][INFO ][node ] [MS095]
{0.20.5}[1948]: initialized
[2013-06-22 18:09:32,647][INFO ][node ] [MS095]
{0.20.5}[1948]: starting ...
[2013-06-22 18:09:32,850][INFO ][transport ] [MS095]
bound_address {inet[/149.59.13.95:9300]}, publish_address
{inet[/149.59.13.95:9300]}
[2013-06-22 18:09:53,972][INFO ][cluster.service ] [MS095]
new_master [MS095][qI6pqS46Q82e-a-k4fDNaw][inet[/149.59.13.95:9300]],
reason: zen-disco-join (elected_as_master)
[2013-06-22 18:09:53,988][INFO ][discovery ] [MS095]
elasticsearch/qI6pqS46Q82e-a-k4fDNaw
[2013-06-22 18:09:54,034][INFO ][http ] [MS095]
bound_address {inet[/149.59.13.95:7105]}, publish_address
{inet[/149.59.13.95:7105]}
[2013-06-22 18:09:54,034][INFO ][node ] [MS095]
{0.20.5}[1948]: started
[2013-06-22 18:09:56,733][INFO ][gateway ] [MS095]
recovered [1] indices into cluster_state
[2013-06-22 20:02:57,956][DEBUG][action.search.type ] [MS095] Node
[TQg2TmdsRWGugg1OSxyhJA] not available for scroll request
[scan;5;5:TQg2TmdsRWGugg1OSxyhJA;1:TQg2TmdsRWGugg1OSxyhJA;2:TQg2TmdsRWGugg1OSxyhJA;3:TQg2TmdsRWGugg1OSxyhJA;4:TQg2TmdsRWGugg1OSxyhJA;1;total_hits:1558;]

====================================================================================

[2013-06-22 18:06:44,705][INFO ][discovery.zen ] [MS084]
master_left [[MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]]],
reason [shut_down]
[2013-06-22 18:06:44,727][INFO ][cluster.service ] [MS084] master
{new [MS084][SIpZ4d0kR4CUEVAMzrcQCg][inet[/149.59.13.184:9300]], previous
[MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]]}, removed
{[MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]],}, reason:
zen-disco-master_failed
([MS095][o_Ija_q2Tx-RHS1xIoYxaQ][inet[/149.59.13.95:9300]])
[2013-06-22 18:08:38,229][INFO ][node ] [MS084]
{0.20.5}[19968]: stopping ...
[2013-06-22 18:08:38,399][INFO ][node ] [MS084]
{0.20.5}[19968]: stopped
[2013-06-22 18:08:38,399][INFO ][node ] [MS084]
{0.20.5}[19968]: closing ...
[2013-06-22 18:08:38,421][INFO ][node ] [MS084]
{0.20.5}[19968]: closed
[2013-06-22 18:13:09,024][INFO ][node ] [MS084]
{0.20.5}[1832]: initializing ...
[2013-06-22 18:13:09,079][INFO ][plugins ] [MS084] loaded
, sites [head]
[2013-06-22 18:13:13,895][INFO ][node ] [MS084]
{0.20.5}[1832]: initialized
[2013-06-22 18:13:13,896][INFO ][node ] [MS084]
{0.20.5}[1832]: starting ...
[2013-06-22 18:13:14,048][INFO ][transport ] [MS084]
bound_address {inet[/149.59.13.184:9300]}, publish_address
{inet[/149.59.13.184:9300]}
[2013-06-22 18:13:35,113][INFO ][cluster.service ] [MS084]
new_master [MS084][TQg2TmdsRWGugg1OSxyhJA][inet[/149.59.13.184:9300]],
reason: zen-disco-join (elected_as_master)
[2013-06-22 18:13:35,140][INFO ][discovery ] [MS084]
elasticsearch/TQg2TmdsRWGugg1OSxyhJA
[2013-06-22 18:13:35,184][INFO ][http ] [MS084]
bound_address {inet[/149.59.13.184:7105]}, publish_address
{inet[/149.59.13.184:7105]}
[2013-06-22 18:13:35,184][INFO ][node ] [MS084]
{0.20.5}[1832]: started
[2013-06-22 18:13:36,735][INFO ][gateway ] [MS084]
recovered [1] indices into cluster_state
[2013-06-22 18:13:59,584][DEBUG][action.search.type ] [MS084] Node
[qI6pqS46Q82e-a-k4fDNaw] not available for scroll request
[scan;5;26:qI6pqS46Q82e-a-k4fDNaw;28:qI6pqS46Q82e-a-k4fDNaw;27:qI6pqS46Q82e-a-k4fDNaw;30:qI6pqS46Q82e-a-k4fDNaw;29:qI6pqS46Q82e-a-k4fDNaw;1;total_hits:1558;]

On Friday, June 14, 2013 6:11:45 PM UTC-7, Mxims wrote:

We currently have Elasticsearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece
of code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

  • @ 18:08 MS084 node was stopped

  • @ 18:13 MS084 was back online, and started the service, at this time
    the node discovered the other node MS095, and added to the cluster

  • @ 18:57 MS095 node was stopped

  • @ 18:59 MS095 node was back online and initialized and started. At
    this time, this node did not discover the other node MS084. So the cluster
    failed

  • @ 19:00 onward you can see that the search requests started error
    out - not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search
and indexing requests started working alright!

I am guessing there is something I messed up in the settings of the
cluster on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Does the following configurationhelp ?

Set to ensure a node sees N other master eligible nodes to be considered

operational within the cluster. Set this option to a higher value (2-4)

for large clusters (>3 nodes):

discovery.zen.minimum_master_nodes: 2

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Does the following configuration help ?

Set to ensure a node sees N other master eligible nodes to be considered

operational within the cluster. Set this option to a higher value (2-4)

for large clusters (>3 nodes):

discovery.zen.minimum_master_nodes: 2

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I reproduced it without the configuration

discovery.zen.minimum_master_nodes: 2

After inserting it, it worked for me

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Christian,

Thank you so much for your help.

I have configured it on both the nodes in production, and restarted the
nodes. Hopefully it will not fail anymore.

Thank you so much,
Renjith

On Friday, June 14, 2013 6:11:45 PM UTC-7, Mxims wrote:

We currently have Elasticsearch version 20.5 running in Production (on
Windows 2008 64-bit) indexing and searching thousands of documents for the
past 2+ months. So far everything is working just fine, however we are
running into issues when the machines are restarted during weekends for
service maintenance. When machines are restarted, they are not getting
connected to each other on the cluster, because of this indexing and
searching requests are failing until I manually restart one of the services.

If my settings are correct, we currently have two nodes on the cluster,
MS084 and MS095. These two are supposed to be acting as master-replica
between the two. If one is down, the other node is supposed to take care of
the indexing and search requests.

For your reference, I have attached the config files and the log files for
the clusters.

From Java code, I am connecting to the cluster using the following piece
of code -

TransportClient tClient = new TransportClient();
tClient = tClient.addTransportAddress(new
InetSocketTransportAddress(hostname, port));

Here is what see on the log file.

  • @ 18:08 MS084 node was stopped

  • @ 18:13 MS084 was back online, and started the service, at this time
    the node discovered the other node MS095, and added to the cluster

  • @ 18:57 MS095 node was stopped

  • @ 18:59 MS095 node was back online and initialized and started. At
    this time, this node did not discover the other node MS084. So the cluster
    failed

  • @ 19:00 onward you can see that the search requests started error
    out - not available for scroll request

I am guessing, this is the behavior that is causing the cluster to fail.

After I restarted the node MS084, the cluster was formed and the search
and indexing requests started working alright!

I am guessing there is something I messed up in the settings of the
cluster on the config files. Please let me know what I am missing.

Thanks a lot for your help!

Renjith

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.