Cluster stopped working but was working fine

grant_donovan · October 23, 2018, 3:03pm

I have an Elasticsearch cluster. This was all working but now seems to have stopped working properly

The head no longer works and gives the message: cluster health: not connected

This command: http://localhost:9200/_cat/health
gives this response

{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

When I try to write I get: ReadTimeoutError(HTTPConnectionPool(...

Timeout is set to 10

Reads work fine

What would be causing the issue? No changes have been made to the servers

Your help would be appreciated

Thanks

Grant

xavierfacq · October 23, 2018, 5:39pm

Hi,

Is your cluster up and running ? Is there any master ? Is it Red, Yellow or Green ?

bye,
Xavier

grant_donovan · October 24, 2018, 10:03am

When I try this, I get the following

curl http://localhost:9200/_cluster/state
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

I have inherited the cluster so I am trying to troubleshoot why the cluster isn't working,I at least want to report to my infrastructure team what should work.

One thing I am presuming should work is

In here: /etc/elasticsearch/elasticsearch.yml

There are references to IP addresses

discovery.zen.ping.unicast.hosts: ['XX.XXX.XXX.XXX', 'YY.YYY.YYY.YYY']

My expectation is that from the client that these should work, as in get a response

telnet XX.XXX.XXX.XXX 9200
curl -XGET "XX.XXX.XXX.XXX:9200"

Currently I don't, and I get a unable to connect: Connection Refused message

Thanks

Grant

xavierfacq · October 24, 2018, 11:33am

Hi,

It seems that node is not connecter to the cluster. There is no master available. Your local node cannot contact masters listed in the discovery.zen.ping.unicast.hosts ? If this is the case you should read this doc:

https://www.elastic.co/guide/en/elasticsearch/reference/current/discovery-settings.html

Maybe the cluster is not joinable because of a firewall or something like that.

bye,
Xavier

DavidTurner · October 24, 2018, 12:14pm

The node's logs will contain messages (including stack traces) describing in a bit more detail why it can't find the master.

grant_donovan · October 25, 2018, 2:42pm

Thanks. I had a look and there were no logs. I restarted the service and saw some errors while I was restarting (in syslog)

I corrected that and I now have some logs (good start)

{#zen_unicast_6_A_ZhFv6mT3i65uDeJUdyjA#}{10.197.163.236}{XX.XXX.XXX.XXX:9300}{master=true}]
Oct 25 10:38:37 hcukazprocatap03 elasticsearch[28877]: [2018-10-25 10:38:37,007][WARN ][transport.netty ] [es-client-01] exception caught on transport layer [[id: 0x5a1fc5a8]], closing connection
Oct 25 10:38:37 hcukazprocatap03 elasticsearch[28877]: java.net.NoRouteToHostException: No route to host

I get a No route to host message, and I presume that relate to this ip: XX.XXX.XXX.XXX:9300

When I telnet to that (telnet XX.XXX.XXX.XXX 9300) I get a connection refused.

Could you confirm that I am approaching this the right way (telnet). My current thoughts are that the port is being blocked.

At the moment I am in the position where I need to tell our infrastructure team what the issue is

Any guidance would be appreciated

Thanks

Grant

DavidTurner · October 25, 2018, 5:48pm

Yes, this sounds like connectivity issues. telnet is a reasonable way to test basic connectivity to an Elasticsearch node's transport port (which defaults to 9300). If you manage to establish a connection, hitting <Enter> a few times should close the connection and yield the following sort of log messages on Elasticsearch's side which lets you see that you've actually connected to Elasticsearch and not to something else.

[2018-10-25T18:44:23,948][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [p6N7aBv] exception caught on transport layer [NettyTcpChannel{localAddress=/0:0:0:0:0:0:0:1:9300, remoteAddress=/0:0:0:0:0:0:0:1:53900}], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (d,a,d,a)

grant_donovan · October 26, 2018, 10:50am

Thanks for your help with this.

Having looked into this further it looked like the Elasticsearch service on the other boxes had stopped. I don't know why, I need to add some extra monitoring onto that.

All looks to be sorted now

Thanks again

Grant

system · November 23, 2018, 10:50am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster is not running and master is not discovered Elasticsearch	5	755	October 15, 2019
Elasticsearch Master not discovered exception Elasticsearch	7	12841	June 27, 2018
Timeout/Connection error Elasticsearch	3	1651	July 5, 2017
Master not discovered exception in my cluster Elasticsearch	3	17095	October 3, 2017
ElasticSearch 7.0 Cluster Error Elasticsearch	5	2215	May 29, 2019

Cluster stopped working but was working fine

Related topics