Cluster health times out


(arta) #1

Hi,
After I restarted ES cluster (3 nodes), all my curl -XGET 'http://localhost:9200/_cluster/health' times out on all nodes.
I changed the cluster name in config/elasticsearch.yml and restarted ES, I can get cluster health response back.
But if I change the cluster name back to the original and restart them, cluster health times out.
Please advice where I should look at or what I should try.

Thank you for your help.


(Patrick) #2

Could you perhaps send along a copy of your configuration file? Is it the
same on all 3 nodes?

Patrick


patrick eefy net

On Fri, Jun 1, 2012 at 1:51 PM, arta artasano@sbcglobal.net wrote:

Hi,
After I restarted ES cluster (3 nodes), all my curl -XGET
'http://localhost:9200/_cluster/health' times out on all nodes.
I changed the cluster name in config/elasticsearch.yml and restarted ES, I
can get cluster health response back.
But if I change the cluster name back to the original and restart them,
cluster health times out.
Please advice where I should look at or what I should try.

Thank you for your help.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(arta) #3

Thank you for the quick reply Patric.
I use default elasticsearch.yml only cluster name and log file location changed.
All 3 nodes have the same config.

Oh, I need to be more spacific what 'times out' means:
curl http://localhost:9200/_cluster/health
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}

I tried _status and got this:
curl http://localhost:9200/_status
{"error":"ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]","status":503}


(Patrick) #4

Hey,

So just to confirm, were you setting the cluster name originally, or did
you not set the cluster name at all (when it wouldn't work)?

Patrick


patrick eefy net

On Fri, Jun 1, 2012 at 2:03 PM, arta artasano@sbcglobal.net wrote:

Thank you for the quick reply Patric.
I use default elasticsearch.yml only cluster name and log file location
changed.
All 3 nodes have the same config.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018695.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(arta) #5

I originally changed the cluster name to 'es-cluster1'
Then I had the problem.
So for an experiment, I changed it to 'zzz' and restarted ES.
I got cluster health 'green' with this setup.
Then I changed it back to 'es-cluster1' and restarted ES.
cluster health again times out.


(arta) #6

Here's clarification and additional information:
what 'times out' means:
curl http://localhost:9200/_cluster/health
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}

I tried _status and got this:
curl http://localhost:9200/_status
{"error":"ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]","status":503}

No nodes ES founds:
curl localhost:9200/_cluster/nodes
{"ok":true,"cluster_name":"es-cluster1","nodes":{}}


(Patrick) #7

I can't find any references to this online, but it could simply be that
cluster names with a '-' in them are not supported at this time. Have you
tried ESCluster1 as a name ? or ES.Cluster1?

Patrick


patrick eefy net

On Fri, Jun 1, 2012 at 2:46 PM, arta artasano@sbcglobal.net wrote:

I originally changed the cluster name to 'es-cluster1'
Then I had the problem.
So for an experiment, I changed it to 'zzz' and restarted ES.
I got cluster health 'green' with this setup.
Then I changed it back to 'es-cluster1' and restarted ES.
cluster health again times out.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018699.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(arta) #8

Thanks Patrick.
I have a sand-box environment, and I'm using the same cluster name there without any problem.
So I don't think '-' is the reason of the problem.
And, if I change the cluster name to something else, whatever it is, it works even in the problem environment.
Eventually I may have to change the cluster name and reindex everything, but I want to figure out what the cause of the problem is.


(Igor Motov) #9

What do you see in the log files?

On Friday, June 1, 2012 5:30:13 PM UTC-4, arta wrote:

Thanks Patrick.
I have a sand-box environment, and I'm using the same cluster name there
without any problem.
So I don't think '-' is the reason of the problem.
And, if I change the cluster name to something else, whatever it is, it
works even in the problem environment.
Eventually I may have to change the cluster name and reindex everything,
but
I want to figure out what the cause of the problem is.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018705.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(sujoysett) #10

Are you sure you you are not somehow setting node.master: false in the
config file while modifying cluster names?
I had encountered this problem earlier, and this typical message comes when
there are only data nodes and no master node.
Health, status, nodes, etc. cease to function in absence of the master node.

On Monday, June 4, 2012 6:17:29 PM UTC+5:30, Igor Motov wrote:

What do you see in the log files?

On Friday, June 1, 2012 5:30:13 PM UTC-4, arta wrote:

Thanks Patrick.
I have a sand-box environment, and I'm using the same cluster name there
without any problem.
So I don't think '-' is the reason of the problem.
And, if I change the cluster name to something else, whatever it is, it
works even in the problem environment.
Eventually I may have to change the cluster name and reindex everything,
but
I want to figure out what the cause of the problem is.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018705.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(arta) #11

Thanks for responding, Igor, sujoysett.

In the log file I see followings:
(node-1)
[2012-06-01 11:24:40,263][INFO ][discovery.zen ] [Blob] failed to send join request to master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], reason [org.elasticsearch.transport.RemoteTransportException: [Father Time][inet[/10.5.124.115:9300]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[Father Time][zt8kbMEiTEKj2hIlRwEP7g][inet[/10.5.124.115:9300]]] not master for join request from [[Blob][vOr5-xBkRfedzFbHi8FaFw][inet[/10.5.124.107:9300]]]]

(node-2)
[2012-06-01 11:23:57,644][WARN ][discovery.zen ] [Doctor Dorcas] failed to connect to master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], retrying...
org.elasticsearch.transport.ConnectTransportException: [Living Colossus][inet[/10.5.124.115:9300]] connect_timeout[30s]
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:560)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:503)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:482)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at org.elasticsearch.discovery.zen.ZenDiscovery.innterJoinCluster(ZenDiscovery.java:312)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$500(ZenDiscovery.java:69)
at org.elasticsearch.discovery.zen.ZenDiscovery$1.run(ZenDiscovery.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:399)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:361)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:277)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more
-- also this --
[2012-06-01 11:24:41,112][INFO ][discovery.zen ] [Battering Ram] failed to send join request to master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], reason [org.elasticsearch.transport.RemoteTransportException: [Father Time][inet[/10.5.124.115:9300]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[Father Time][zt8kbMEiTEKj2hIlRwEP7g][inet[/10.5.124.115:9300]]] not master for join request from [[Battering Ram][MygDoIOdQDmBZbgNY130lQ][inet[/10.5.124.110:9300]]]]

(node-3)
[2012-06-01 11:40:00,219][WARN ][discovery.zen.ping.multicast] [Father Time] failed to receive confirmation on sent ping response to [[Blob][vOr5-xBkRfedzFbHi8FaFw][inet[/10.5.124.107:9300]]]
org.elasticsearch.transport.NodeDisconnectedException: [Blob][inet[/10.5.124.107:9300]][discovery/zen/multicast] disconnected
[2012-06-01 11:40:00,220][WARN ][discovery.zen.ping.multicast] [Father Time] failed to receive confirmation on sent ping response to [[Blob][vOr5-xBkRfedzFbHi8FaFw][inet[/10.5.124.107:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [Blob][inet[/10.5.124.107:9300]][discovery/zen/multicast]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:200)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:172)
at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:531)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [Blob][inet[/10.5.124.107:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:637)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:445)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:185)
... 5 more

I did not set node.master: false.

Thanks for your help.


(arta) #12

Additional Info:
node-1 is 10.5.124.107
node-2 is 10.5.124.110
node-3 is 10.5.124.115


(arta) #13

I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][discovery.zen.ping.multicast] [Turner D. Century] [1] received ping_response{target [[Seamus Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/10.5.124.114:9300]]{client=true, data=false}], master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], cluster_name[es-cluster1]}

The node responding to the multicast is not a elasticsearch node, but a node that uses elasticsearch Java API and in where elasticsearch client is running.
The client responded to the multicast discovery ping and somehow answered inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster health request now.

My guess is that the reason of the problem was that I restarted all elasticsearch nodes but I did not restart the service that is using elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch cluster? Or is there any condition which requires us to do so?

Thanks for your help.


(Igor Motov) #14

Indeed, the issue might have occurred because one of the Java API clients
didn't detect that master was gone and was broadcasting old master id to
other nodes. We experienced simliar issues in the past and until we got rid
of all Java API clients our standard operating procedure was to stop all
Java API clients before full cluster restart.

On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:

I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][discovery.zen.ping.multicast] [Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/10.5.124.114:9300]]{client=true,
data=false}], master [[Living
Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]],
cluster_name[es-cluster1]}

The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.

My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?

Thanks for your help.

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018811.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Patrick) #15

Have you guys logged a bug around this perhaps ?

Patrick


patrick eefy net

On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov imotov@gmail.com wrote:

Indeed, the issue might have occurred because one of the Java API clients
didn't detect that master was gone and was broadcasting old master id to
other nodes. We experienced simliar issues in the past and until we got rid
of all Java API clients our standard operating procedure was to stop all
Java API clients before full cluster restart.

On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:

I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][**discovery.zen.ping.multicast] [Turner
D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][**7hKKDw5ARY22JDKA6brSSA][inet[/**10.5.124.114:9300]]{client=
**true,
data=false}], master [[Living
Colossus][**mcxvTZ78T5uMbkSSh61lhw][inet[/**10.5.124.115:9300]]],
cluster_name[es-cluster1]}

The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.

My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?

Thanks for your help.

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/cluster-**health-times-out-**tp4018693p4018811.htmlhttp://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018811.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Shay Banon) #16

Eventually, the client node will detect the master node does not exists,
and will stop broadcasting it. I wonder though if with multicast, it does
not make sense not use the client nodes as ones to help with master
election, as they might have different communication settings to the
cluster.

On Tue, Jun 5, 2012 at 3:27 PM, Patrick patrick@eefy.net wrote:

Have you guys logged a bug around this perhaps ?

Patrick

http://about.me/patrick.ancillotti
patrick eefy net

On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov imotov@gmail.com wrote:

Indeed, the issue might have occurred because one of the Java API clients
didn't detect that master was gone and was broadcasting old master id to
other nodes. We experienced simliar issues in the past and until we got rid
of all Java API clients our standard operating procedure was to stop all
Java API clients before full cluster restart.

On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:

I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][**discovery.zen.ping.multicast]
[Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][**7hKKDw5ARY22JDKA6brSSA][inet[/**10.5.124.114:9300
]]{client=**true,
data=false}], master [[Living
Colossus][**mcxvTZ78T5uMbkSSh61lhw][inet[/**10.5.124.115:9300]]],
cluster_name[es-cluster1]}

The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow
answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.

My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?

Thanks for your help.

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/cluster-**health-times-out-**tp4018693p4018811.htmlhttp://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018811.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.


(Jessica Kerr) #17

This bit us: restarting elasticsearch VMs on ec2 doesn't work until we take
down our web applications. It kept looking for the old master because the
client nodes remembered it, and that IP address didn't exist anymore. This
is going to impact production: if elasticsearch goes down we'll require an
outage to get it restarted.

It would be very helpful if the client nodes did not contribute to master
election, or in some way could be overruled if that master is gone.

On Friday, June 8, 2012 5:31:47 PM UTC-5, kimchy wrote:

Eventually, the client node will detect the master node does not exists,
and will stop broadcasting it. I wonder though if with multicast, it does
not make sense not use the client nodes as ones to help with master
election, as they might have different communication settings to the
cluster.

On Tue, Jun 5, 2012 at 3:27 PM, Patrick <pat...@eefy.net <javascript:>>wrote:

Have you guys logged a bug around this perhaps ?

Patrick

http://about.me/patrick.ancillotti
patrick eefy net

On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov <imo...@gmail.com<javascript:>

wrote:

Indeed, the issue might have occurred because one of the Java API
clients didn't detect that master was gone and was broadcasting old master
id to other nodes. We experienced simliar issues in the past and until we
got rid of all Java API clients our standard operating procedure was to
stop all Java API clients before full cluster restart.

On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:

I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][**discovery.zen.ping.multicast]
[Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/
10.5.124.114:9300]]{client=**true,
data=false}], master [[Living
Colossus][**mcxvTZ78T5uMbkSSh61lhw][inet[/**10.5.124.115:9300]]],
cluster_name[es-cluster1]}

The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow
answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.

My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?

Thanks for your help.

--
View this message in context: http://elasticsearch-users.**
115913.n3.nabble.com/cluster-health-times-out-
tp4018693p4018811.htmlhttp://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018811.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--


(Shay Banon) #18

They no longer do, which version are you using?

On Aug 13, 2012, at 5:00 PM, Jessica Kerr jessitron@gmail.com wrote:

This bit us: restarting elasticsearch VMs on ec2 doesn't work until we take down our web applications. It kept looking for the old master because the client nodes remembered it, and that IP address didn't exist anymore. This is going to impact production: if elasticsearch goes down we'll require an outage to get it restarted.

It would be very helpful if the client nodes did not contribute to master election, or in some way could be overruled if that master is gone.

On Friday, June 8, 2012 5:31:47 PM UTC-5, kimchy wrote:
Eventually, the client node will detect the master node does not exists, and will stop broadcasting it. I wonder though if with multicast, it does not make sense not use the client nodes as ones to help with master election, as they might have different communication settings to the cluster.

On Tue, Jun 5, 2012 at 3:27 PM, Patrick pat...@eefy.net wrote:
Have you guys logged a bug around this perhaps ?

Patrick

http://about.me/patrick.ancillotti
patrick eefy net

On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov imo...@gmail.com wrote:
Indeed, the issue might have occurred because one of the Java API clients didn't detect that master was gone and was broadcasting old master id to other nodes. We experienced simliar issues in the past and until we got rid of all Java API clients our standard operating procedure was to stop all Java API clients before full cluster restart.

On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:
I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][discovery.zen.ping.multicast] [Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/10.5.124.114:9300]]{client=true,
data=false}], master [[Living
Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]],
cluster_name[es-cluster1]}

The node responding to the multicast is not a elasticsearch node, but a node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster health
request now.

My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch cluster?
Or is there any condition which requires us to do so?

Thanks for your help.

--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/cluster-health-times-out-tp4018693p4018811.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--

--


(system) #19