Failed, restarting discovery Master not available

Hi I hope someone here can help.
i've recently setup a three node cluster on AKS using the elasticsearch helm chart, using pretty much all of the default values (aprat from the obvious persistent volumes etc..) and everything works great for a few hours but then I get a error

{"type": "server", "timestamp": "2019-10-11T15:50:20,748Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "d2c-es-cluster-nonprod", "node.name": "d2c-es-cluster-nonprod-master-1", "message": "master node [{d2c-es-cluster-nonprod-master-2}{IeYJlsrSQ3KFxNBdn8M7sA}{qlURXoZ7SW6uKlBCoB1j5A}{10.253.156.134}{10.253.156.134:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery", "cluster.uuid": "XPDaEguqRnWjluTh-Y6UVA", "node.id": "kyTywCrARMO1TGRDaWkoPA"

the three node setup is configure with three master, ingest and data nodes (as per the default) i've checked each node and as far as I can tell each node is able to communicate with each other over port 9300 and 9200

as other posts suggest, i've tried to increase the timeouts, i've turned of Authentication and SSL but after a few hours I get the error mentioned above and the service discovery fails - only way to fix it is to restar the entire cluster.

i've attached my values yaml file here https://pastebin.com/S59buPPw

i've been pulling my hair out on this one for days!.

Still not getting anywhere with this, i've turned on trace logging and the first error that is encountered is:
{"type": "server", "timestamp": "2019-10-15T17:52:27,279Z", "level": "TRACE", "component": "o.e.t.T.tracer", "cluster.name": "d2c-es-cluster-nonprod", "node.name": "d2c-es-cluster-nonprod-master-1", "message": "[30666][internal:cluster/request_pre_vote] sent error response", "cluster.uuid": "o44iCCQCSvWvbMlS9DJLIg", "node.id": "_AfuhQHlTCO4p9jIVojAVw" ,

no ideas what this means.

All these tiny log fragments tell is is that something went wrong. You will need to share much more of your logs to help us help you work out what that "something" is.

1 Like

Thanks David,

appreciate the response,
i've attached a link to the logs from one of my nodes (I believe it is an elected master node)
https://pastebin.com/j2fvT3SV - node 2

this is a vanilla 3 node cluster - doesn't even have any indices, and after a few hours i tried to submit a basic template script but it fails with a master not available exception.

other node logs:
https://pastebin.com/xmzhGTQN - node 0
https://pastebin.com/tWbdpCXF - node 1

Thanks, that's useful. Looking at the logs from d2c-es-cluster-nonprod-development-master-2 (which was indeed the elected master from 2019-10-16T10:27:26Z) I see that it was silent until 2019-10-16T12:39:31Z and then reported issues until 2019-10-16T12:40:02Z when it looks like it was re-elected as master again. Was that the end of it or did it carry on reporting problems? What was the next log message?

There are no more log messages, I tried to post a template script but I just get nothing out of it. completely dead.

only way to recover it is to restart the set. occasionally when it does this the cluster is still partially responding but any posts I get a service unavailable message repeated, example below.

BLOCKQUOTE
{"type": "server", "timestamp": "2019-10-12T23:56:56,874Z", "level": "WARN", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "d2c-es-cluster-nonprod", "node.name": "d2c-es-cluster-nonprod-master-2", "message": "unexpected error while indexing monitoring document", "cluster.uuid": "XPDaEguqRnWjluTh-Y6UVA", "node.id": "IeYJlsrSQ3KFxNBdn8M7sA" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: RemoteTransportException[[d2c-es-cluster-nonprod-master-1][10.253.156.174:9300][indices:data/write/bulk[s]]]; nested: ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];];",
"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.4.0.jar:7.4.0]",
"at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]",
"at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177) ~[?:?]",
"at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:?]",
"at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]",
"at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]",
"at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]",

There are a few things here that I'm struggling to fit into my understanding of how Elasticsearch works.

If a node doesn't have a master then it will (by default) emit a log message every 10 seconds indicating as such. If we're not seeing these log messages then that tells us that these nodes do have a master most of the time. But if these nodes do have a master most of the time and then occasionally report blocked by: [SERVICE_UNAVAILABLE/2/no master] then I think they would be logging master node changed messages too. And yet you say that isn't happening?

The message you've quoted, from o.e.x.m.e.l.LocalExporter, is about the failure of an internal action triggered on a timer and should not have anything to do with a request that you are sending. Perhaps when you send a request it is triggering a temporary master failure, resulting in that message, and then the master recovers. But again this would result in a master node changed message that you are saying isn't happening?

I think we're still missing far too much detail here. Can you be a bit more precise? Specifically, can you share all the logs as well as the API calls you are making, and the responses, and indicate the timeline of those API calls against the logs. The message you've quoted happened at 2019-10-12T23:56Z, many days ago, and I don't see how it relates to the other logs you shared in Pastebin above.

Thanks David,

Sorry i've had so many different versions of the cluster up and down over the past few days i got a bit lost in all the log files!

i've been setting up another clean cluster this afternoon and received the error - all I was doing was using kibana to create an index lifecycle policy and I got the below info at 2019-10-16T20:09:19,536Z
{"type": "server", "timestamp": "2019-10-16T20:09:19,536Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "after [10s] publication of cluster state version [60] is still waiting for {d2c-es-cluster-prod-master-0}{toxb8i8QTQqxV85UGFwAOA}{3eN9f7vVQnq4uirgonH3cQ}{10.253.142.32}{10.253.142.32:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-2}{hGtaxylJToeRvQ6fKQmfkg}{8PIeKHJzRW-Qrxr3Aohcww}{10.253.142.77}{10.253.142.77:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20} [WAITING_FOR_QUORUM]", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" }

and then 30 seconds later getting thei following error
{"type": "server", "timestamp": "2019-10-16T20:09:46,415Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "path: /_cluster/settings, params: {include_defaults=true}", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" , "stacktrace": ["org.elasticsearch.discovery.MasterNotDiscoveredException: null", "at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:214) [elasticsearch-7.4.0.jar:7.4.0]", "at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:325) [elasticsearch-7.4.0.jar:7.4.0]", "at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:252) [elasticsearch-7.4.0.jar:7.4.0]", "at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:598) [elasticsearch-7.4.0.jar:7.4.0]", "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [elasticsearch-7.4.0.jar:7.4.0]", "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]", "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]", "at java.lang.Thread.run(Thread.java:830) [?:?]"] }

i've attached the full log here:
https://pastebin.com/Cz6k9j3f

Ok, interesting. I still think we aren't seeing enough logs. The last master-node change is here, in which d2c-es-cluster-prod-master-1 is elected:

{"type": "server", "timestamp": "2019-10-16T20:09:09,534Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "elected-as-master ([2] nodes joined)[{d2c-es-cluster-prod-master-0}{toxb8i8QTQqxV85UGFwAOA}{3eN9f7vVQnq4uirgonH3cQ}{10.253.142.32}{10.253.142.32:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} elect leader, {d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 2, version: 60, reason: master node changed {previous [], current [{d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20}]}", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" }

However I don't see messages indicating that this election has completely finished, or has timed out, and then the logs finish just a couple of minutes later. Can you collect logs for longer? Ideally I'd like to see the election succeed and the cluster quieten down again, or else see three or four elected-as-master messages from retries of the elections.

There is some suggestion that cluster state publication is happening very slowly:

{"type": "server", "timestamp": "2019-10-16T20:06:19,442Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "after [10.1s] publication of cluster state version [59] is still waiting for {d2c-es-cluster-prod-master-0}{toxb8i8QTQqxV85UGFwAOA}{3eN9f7vVQnq4uirgonH3cQ}{10.253.142.32}{10.253.142.32:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-2}{hGtaxylJToeRvQ6fKQmfkg}{8PIeKHJzRW-Qrxr3Aohcww}{10.253.142.77}{10.253.142.77:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20} [WAITING_FOR_QUORUM]", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" }

However this isn't obviously a network issue, because the next publication has the same issue and this is just after an election:

{"type": "server", "timestamp": "2019-10-16T20:09:19,536Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "after [10s] publication of cluster state version [60] is still waiting for {d2c-es-cluster-prod-master-0}{toxb8i8QTQqxV85UGFwAOA}{3eN9f7vVQnq4uirgonH3cQ}{10.253.142.32}{10.253.142.32:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-2}{hGtaxylJToeRvQ6fKQmfkg}{8PIeKHJzRW-Qrxr3Aohcww}{10.253.142.77}{10.253.142.77:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20} [WAITING_FOR_QUORUM]", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" }

Not sure if this helps, but i've turned on monitoring within Kibana and I see the regular outages, are there any processes in elastic that run at regular intervals?

It's kind of interesting but it doesn't really help, no.

Thanks for your help David,

Incase anyone stumbles across this with the same issue, i've managed to work out what the issue is.
basically, due to our connections between kubernetes nodes going via an enterprise firewall - the firewall was terminating tcp keep alive connections and basically timing them out after a period of inactivity.

to fix this I set in the elasticsearch config
transport.ping_schedule: 5s

this, as far as I understand it keeps the tcp connection alive by sending ping requests every 5 seconds.

Well spotted. This daft behaviour of firewalls, along with the workaround, is mentioned in the docs:

Elasticsearch opens a number of long-lived TCP connections between each pair of nodes in the cluster, and some of these connections may be idle for an extended period of time. Nonetheless, Elasticsearch requires these connections to remain open, and it can disrupt the operation of the cluster if any inter-node connections are closed by an external influence such as a firewall. It is important to configure your network to preserve long-lived idle connections between Elasticsearch nodes, for instance by leaving tcp.keep_alive enabled and ensuring that the keepalive interval is shorter than any timeout that might cause idle connections to be closed, or by setting transport.ping_schedule if keepalives cannot be configured.

TCP keepalives are preferable if you can configure your firewall to respect them, but transport.ping_schedule is a viable workaround if not.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.