Hi,
After I restarted ES cluster (3 nodes), all my curl -XGET 'http://localhost:9200/_cluster/health' times out on all nodes.
I changed the cluster name in config/elasticsearch.yml and restarted ES, I can get cluster health response back.
But if I change the cluster name back to the original and restart them, cluster health times out.
Please advice where I should look at or what I should try.
Hi,
After I restarted ES cluster (3 nodes), all my curl -XGET
'http://localhost:9200/_cluster/health' times out on all nodes.
I changed the cluster name in config/elasticsearch.yml and restarted ES, I
can get cluster health response back.
But if I change the cluster name back to the original and restart them,
cluster health times out.
Please advice where I should look at or what I should try.
Thank you for the quick reply Patric.
I use default elasticsearch.yml only cluster name and log file location changed.
All 3 nodes have the same config.
Oh, I need to be more spacific what 'times out' means:
curl http://localhost:9200/_cluster/health
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}
I tried _status and got this:
curl http://localhost:9200/_status
{"error":"ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]","status":503}
Thank you for the quick reply Patric.
I use default elasticsearch.yml only cluster name and log file location
changed.
All 3 nodes have the same config.
I originally changed the cluster name to 'es-cluster1'
Then I had the problem.
So for an experiment, I changed it to 'zzz' and restarted ES.
I got cluster health 'green' with this setup.
Then I changed it back to 'es-cluster1' and restarted ES.
cluster health again times out.
Here's clarification and additional information:
what 'times out' means:
curl http://localhost:9200/_cluster/health
{"error":"MasterNotDiscoveredException[waited for [30s]]","status":503}
I tried _status and got this:
curl http://localhost:9200/_status
{"error":"ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master];]","status":503}
No nodes ES founds:
curl localhost:9200/_cluster/nodes
{"ok":true,"cluster_name":"es-cluster1","nodes":{}}
I can't find any references to this online, but it could simply be that
cluster names with a '-' in them are not supported at this time. Have you
tried ESCluster1 as a name ? or ES.Cluster1?
I originally changed the cluster name to 'es-cluster1'
Then I had the problem.
So for an experiment, I changed it to 'zzz' and restarted ES.
I got cluster health 'green' with this setup.
Then I changed it back to 'es-cluster1' and restarted ES.
cluster health again times out.
Thanks Patrick.
I have a sand-box environment, and I'm using the same cluster name there without any problem.
So I don't think '-' is the reason of the problem.
And, if I change the cluster name to something else, whatever it is, it works even in the problem environment.
Eventually I may have to change the cluster name and reindex everything, but I want to figure out what the cause of the problem is.
On Friday, June 1, 2012 5:30:13 PM UTC-4, arta wrote:
Thanks Patrick.
I have a sand-box environment, and I'm using the same cluster name there
without any problem.
So I don't think '-' is the reason of the problem.
And, if I change the cluster name to something else, whatever it is, it
works even in the problem environment.
Eventually I may have to change the cluster name and reindex everything,
but
I want to figure out what the cause of the problem is.
Are you sure you you are not somehow setting node.master: false in the
config file while modifying cluster names?
I had encountered this problem earlier, and this typical message comes when
there are only data nodes and no master node.
Health, status, nodes, etc. cease to function in absence of the master node.
On Monday, June 4, 2012 6:17:29 PM UTC+5:30, Igor Motov wrote:
What do you see in the log files?
On Friday, June 1, 2012 5:30:13 PM UTC-4, arta wrote:
Thanks Patrick.
I have a sand-box environment, and I'm using the same cluster name there
without any problem.
So I don't think '-' is the reason of the problem.
And, if I change the cluster name to something else, whatever it is, it
works even in the problem environment.
Eventually I may have to change the cluster name and reindex everything,
but
I want to figure out what the cause of the problem is.
In the log file I see followings:
(node-1)
[2012-06-01 11:24:40,263][INFO ][discovery.zen ] [Blob] failed to send join request to master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], reason [org.elasticsearch.transport.RemoteTransportException: [Father Time][inet[/10.5.124.115:9300]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[Father Time][zt8kbMEiTEKj2hIlRwEP7g][inet[/10.5.124.115:9300]]] not master for join request from [[Blob][vOr5-xBkRfedzFbHi8FaFw][inet[/10.5.124.107:9300]]]]
(node-2)
[2012-06-01 11:23:57,644][WARN ][discovery.zen ] [Doctor Dorcas] failed to connect to master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], retrying...
org.elasticsearch.transport.ConnectTransportException: [Living Colossus][inet[/10.5.124.115:9300]] connect_timeout[30s]
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:560)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:503)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:482)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:128)
at org.elasticsearch.discovery.zen.ZenDiscovery.innterJoinCluster(ZenDiscovery.java:312)
at org.elasticsearch.discovery.zen.ZenDiscovery.access$500(ZenDiscovery.java:69)
at org.elasticsearch.discovery.zen.ZenDiscovery$1.run(ZenDiscovery.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:592)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.connect(NioClientSocketPipelineSink.java:399)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.processSelectedKeys(NioClientSocketPipelineSink.java:361)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink$Boss.run(NioClientSocketPipelineSink.java:277)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more
-- also this --
[2012-06-01 11:24:41,112][INFO ][discovery.zen ] [Battering Ram] failed to send join request to master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], reason [org.elasticsearch.transport.RemoteTransportException: [Father Time][inet[/10.5.124.115:9300]][discovery/zen/join]; org.elasticsearch.ElasticSearchIllegalStateException: Node [[Father Time][zt8kbMEiTEKj2hIlRwEP7g][inet[/10.5.124.115:9300]]] not master for join request from [[Battering Ram][MygDoIOdQDmBZbgNY130lQ][inet[/10.5.124.110:9300]]]]
(node-3)
[2012-06-01 11:40:00,219][WARN ][discovery.zen.ping.multicast] [Father Time] failed to receive confirmation on sent ping response to [[Blob][vOr5-xBkRfedzFbHi8FaFw][inet[/10.5.124.107:9300]]]
org.elasticsearch.transport.NodeDisconnectedException: [Blob][inet[/10.5.124.107:9300]][discovery/zen/multicast] disconnected
[2012-06-01 11:40:00,220][WARN ][discovery.zen.ping.multicast] [Father Time] failed to receive confirmation on sent ping response to [[Blob][vOr5-xBkRfedzFbHi8FaFw][inet[/10.5.124.107:9300]]]
org.elasticsearch.transport.SendRequestTransportException: [Blob][inet[/10.5.124.107:9300]][discovery/zen/multicast]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:200)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:172)
at org.elasticsearch.discovery.zen.ping.multicast.MulticastZenPing$Receiver$1.run(MulticastZenPing.java:531)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [Blob][inet[/10.5.124.107:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:637)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:445)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:185)
... 5 more
I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][discovery.zen.ping.multicast] [Turner D. Century] [1] received ping_response{target [[Seamus Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/10.5.124.114:9300]]{client=true, data=false}], master [[Living Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]], cluster_name[es-cluster1]}
The node responding to the multicast is not a elasticsearch node, but a node that uses elasticsearch Java API and in where elasticsearch client is running.
The client responded to the multicast discovery ping and somehow answered inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster health request now.
My guess is that the reason of the problem was that I restarted all elasticsearch nodes but I did not restart the service that is using elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch cluster? Or is there any condition which requires us to do so?
Indeed, the issue might have occurred because one of the Java API clients
didn't detect that master was gone and was broadcasting old master id to
other nodes. We experienced simliar issues in the past and until we got rid
of all Java API clients our standard operating procedure was to stop all
Java API clients before full cluster restart.
On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:
I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][discovery.zen.ping.multicast] [Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/10.5.124.114:9300]]{client=true,
data=false}], master [[Living
Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]],
cluster_name[es-cluster1]}
The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.
My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?
On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov imotov@gmail.com wrote:
Indeed, the issue might have occurred because one of the Java API clients
didn't detect that master was gone and was broadcasting old master id to
other nodes. We experienced simliar issues in the past and until we got rid
of all Java API clients our standard operating procedure was to stop all
Java API clients before full cluster restart.
On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:
I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][**discovery.zen.ping.multicast] [Turner
D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][**7hKKDw5ARY22JDKA6brSSA][inet[/**10.5.124.114:9300]]{client=
**true,
data=false}], master [[Living
Colossus][**mcxvTZ78T5uMbkSSh61lhw][inet[/**10.5.124.115:9300]]],
cluster_name[es-cluster1]}
The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.
My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?
Eventually, the client node will detect the master node does not exists,
and will stop broadcasting it. I wonder though if with multicast, it does
not make sense not use the client nodes as ones to help with master
election, as they might have different communication settings to the
cluster.
On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov imotov@gmail.com wrote:
Indeed, the issue might have occurred because one of the Java API clients
didn't detect that master was gone and was broadcasting old master id to
other nodes. We experienced simliar issues in the past and until we got rid
of all Java API clients our standard operating procedure was to stop all
Java API clients before full cluster restart.
On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:
I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][**discovery.zen.ping.multicast]
[Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][**7hKKDw5ARY22JDKA6brSSA][inet[/**10.5.124.114:9300
]]{client=**true,
data=false}], master [[Living
Colossus][**mcxvTZ78T5uMbkSSh61lhw][inet[/**10.5.124.115:9300]]],
cluster_name[es-cluster1]}
The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow
answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.
My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?
This bit us: restarting elasticsearch VMs on ec2 doesn't work until we take
down our web applications. It kept looking for the old master because the
client nodes remembered it, and that IP address didn't exist anymore. This
is going to impact production: if elasticsearch goes down we'll require an
outage to get it restarted.
It would be very helpful if the client nodes did not contribute to master
election, or in some way could be overruled if that master is gone.
On Friday, June 8, 2012 5:31:47 PM UTC-5, kimchy wrote:
Eventually, the client node will detect the master node does not exists,
and will stop broadcasting it. I wonder though if with multicast, it does
not make sense not use the client nodes as ones to help with master
election, as they might have different communication settings to the
cluster.
On Tue, Jun 5, 2012 at 3:27 PM, Patrick <pat...@eefy.net <javascript:>>wrote:
On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov <imo...@gmail.com<javascript:>
wrote:
Indeed, the issue might have occurred because one of the Java API
clients didn't detect that master was gone and was broadcasting old master
id to other nodes. We experienced simliar issues in the past and until we
got rid of all Java API clients our standard operating procedure was to
stop all Java API clients before full cluster restart.
On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:
I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][**discovery.zen.ping.multicast]
[Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/
10.5.124.114:9300]]{client=**true,
data=false}], master [[Living
Colossus][**mcxvTZ78T5uMbkSSh61lhw][inet[/**10.5.124.115:9300]]],
cluster_name[es-cluster1]}
The node responding to the multicast is not a elasticsearch node, but a
node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow
answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster
health
request now.
My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch
cluster?
Or is there any condition which requires us to do so?
This bit us: restarting elasticsearch VMs on ec2 doesn't work until we take down our web applications. It kept looking for the old master because the client nodes remembered it, and that IP address didn't exist anymore. This is going to impact production: if elasticsearch goes down we'll require an outage to get it restarted.
It would be very helpful if the client nodes did not contribute to master election, or in some way could be overruled if that master is gone.
On Friday, June 8, 2012 5:31:47 PM UTC-5, kimchy wrote:
Eventually, the client node will detect the master node does not exists, and will stop broadcasting it. I wonder though if with multicast, it does not make sense not use the client nodes as ones to help with master election, as they might have different communication settings to the cluster.
On Tue, Jun 5, 2012 at 3:27 PM, Patrick pat...@eefy.net wrote:
Have you guys logged a bug around this perhaps ?
On Tue, Jun 5, 2012 at 7:26 AM, Igor Motov imo...@gmail.com wrote:
Indeed, the issue might have occurred because one of the Java API clients didn't detect that master was gone and was broadcasting old master id to other nodes. We experienced simliar issues in the past and until we got rid of all Java API clients our standard operating procedure was to stop all Java API clients before full cluster restart.
On Monday, June 4, 2012 7:35:44 PM UTC-4, arta wrote:
I think I sort of figured out what was going on.
I increased the log level and found following log:
[2012-06-04 16:20:29,523][TRACE][discovery.zen.ping.multicast] [Turner D.
Century] [1] received ping_response{target [[Seamus
Mellencamp][7hKKDw5ARY22JDKA6brSSA][inet[/10.5.124.114:9300]]{client=true,
data=false}], master [[Living
Colossus][mcxvTZ78T5uMbkSSh61lhw][inet[/10.5.124.115:9300]]],
cluster_name[es-cluster1]}
The node responding to the multicast is not a elasticsearch node, but a node
that uses elasticsearch Java API and in where elasticsearch client is
running.
The client responded to the multicast discovery ping and somehow answered
inexisting master id.
I stopped that process and all elasticsearch nodes respond to cluster health
request now.
My guess is that the reason of the problem was that I restarted all
elasticsearch nodes but I did not restart the service that is using
elasticsearch client Java API.
Do we have to restart the client everytime we restart elasticsearch cluster?
Or is there any condition which requires us to do so?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.