Cluster stopped working but was working fine

I have an Elasticsearch cluster. This was all working but now seems to have stopped working properly

The head no longer works and gives the message: cluster health: not connected

This command: http://localhost:9200/_cat/health
gives this response

{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

When I try to write I get: ReadTimeoutError(HTTPConnectionPool(...

Timeout is set to 10

Reads work fine

What would be causing the issue? No changes have been made to the servers

Your help would be appreciated

Thanks

Grant

Hi,

Is your cluster up and running ? Is there any master ? Is it Red, Yellow or Green ?

bye,
Xavier

When I try this, I get the following

curl http://localhost:9200/_cluster/state
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}

I have inherited the cluster so I am trying to troubleshoot why the cluster isn't working,I at least want to report to my infrastructure team what should work.

One thing I am presuming should work is

In here: /etc/elasticsearch/elasticsearch.yml

There are references to IP addresses

discovery.zen.ping.unicast.hosts: ['XX.XXX.XXX.XXX', 'YY.YYY.YYY.YYY']

My expectation is that from the client that these should work, as in get a response

telnet XX.XXX.XXX.XXX 9200
curl -XGET "XX.XXX.XXX.XXX:9200"

Currently I don't, and I get a unable to connect: Connection Refused message

Thanks

Grant

Hi,

It seems that node is not connecter to the cluster. There is no master available. Your local node cannot contact masters listed in the discovery.zen.ping.unicast.hosts ? If this is the case you should read this doc:

https://www.elastic.co/guide/en/elasticsearch/reference/current/discovery-settings.html

Maybe the cluster is not joinable because of a firewall or something like that.

bye,
Xavier

The node's logs will contain messages (including stack traces) describing in a bit more detail why it can't find the master.

Thanks. I had a look and there were no logs. I restarted the service and saw some errors while I was restarting (in syslog)

I corrected that and I now have some logs (good start)

{#zen_unicast_6_A_ZhFv6mT3i65uDeJUdyjA#}{10.197.163.236}{XX.XXX.XXX.XXX:9300}{master=true}]
Oct 25 10:38:37 hcukazprocatap03 elasticsearch[28877]: [2018-10-25 10:38:37,007][WARN ][transport.netty ] [es-client-01] exception caught on transport layer [[id: 0x5a1fc5a8]], closing connection
Oct 25 10:38:37 hcukazprocatap03 elasticsearch[28877]: java.net.NoRouteToHostException: No route to host

I get a No route to host message, and I presume that relate to this ip: XX.XXX.XXX.XXX:9300

When I telnet to that (telnet XX.XXX.XXX.XXX 9300) I get a connection refused.

Could you confirm that I am approaching this the right way (telnet). My current thoughts are that the port is being blocked.

At the moment I am in the position where I need to tell our infrastructure team what the issue is

Any guidance would be appreciated

Thanks

Grant

Yes, this sounds like connectivity issues. telnet is a reasonable way to test basic connectivity to an Elasticsearch node's transport port (which defaults to 9300). If you manage to establish a connection, hitting <Enter> a few times should close the connection and yield the following sort of log messages on Elasticsearch's side which lets you see that you've actually connected to Elasticsearch and not to something else.

[2018-10-25T18:44:23,948][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [p6N7aBv] exception caught on transport layer [NettyTcpChannel{localAddress=/0:0:0:0:0:0:0:1:9300, remoteAddress=/0:0:0:0:0:0:0:1:53900}], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (d,a,d,a)

Thanks for your help with this.

Having looked into this further it looked like the Elasticsearch service on the other boxes had stopped. I don't know why, I need to add some extra monitoring onto that.

All looks to be sorted now

Thanks again

Grant

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.