Master Node toggles permanently

michael_0815 · July 23, 2020, 6:17am

Hi there,

we have an issue that we have constantly toggling of the master node - it leaves and joins permanently. We deleted all nodes including pvc and started all over. Same issue in AKS again.
It´s a 3 node setup.

Log of one toggling elasticsearch pod:

{"type": "server", "timestamp": "2020-07-23T06:06:37,487Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [], current [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}]}, term: 639, version: 43472, reason: ApplyCommitRequest{term=639, version=43472, sourceNode={elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }
    {"type": "server", "timestamp": "2020-07-23T06:06:40,447Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug" ,
    "stacktrace": ["org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-master-2][10.86.242.116:9300][disconnected] disconnected"] }
    {"type": "server", "timestamp": "2020-07-23T06:06:40,448Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 639, version: 43472, reason: becoming candidate: onLeaderFailure", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }
    {"type": "server", "timestamp": "2020-07-23T06:06:40,506Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [], current [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}]}, term: 639, version: 43474, reason: ApplyCommitRequest{term=639, version=43474, sourceNode={elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }
    {"type": "server", "timestamp": "2020-07-23T06:06:43,451Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug" ,
    "stacktrace": ["org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-master-2][10.86.242.116:9300][disconnected] disconnected"] }
    {"type": "server", "timestamp": "2020-07-23T06:06:43,452Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 639, version: 43474, reason: becoming candidate: onLeaderFailure", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }

log of the other pod:

{"type": "server", "timestamp": "2020-07-23T06:13:00,084Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-join[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 639, version: 43699, delta: added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:00,131Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}, term: 639, version: 43699, reason: Publication{term=639, version=43699}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:03,077Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-left[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} reason: disconnected], term: 639, version: 43701, delta: removed {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:03,104Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "removed {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}, term: 639, version: 43701, reason: Publication{term=639, version=43701}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:05,102Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-join[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 639, version: 43702, delta: added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:05,142Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}, term: 639, version: 43702, reason: Publication{term=639, version=43702}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:08,094Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-left[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} reason: disconnected], term: 639, version: 43703, delta: removed {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }

We can´t see the reason for this constant toggling. Any idea why that can happen? Thanks a lot!

DavidTurner · July 23, 2020, 7:24am

Something (outside of Elasticsearch) is breaking the network connection between the nodes.

michael_0815 · July 23, 2020, 9:37am

Hi @DavidTurner
thank you - that´s what we thought as well.
We can´t / don´t handle AKS network.
Any Trace level you suggest to set to get more details on the disconnect?
We played a bit with different Tracelevels but just got stacktraces.
Maybe I could call a pingable rest endpoint in the loop and ask AKS Team to trace that.
Any suggestion?

Thanks a lot

DavidTurner · July 23, 2020, 9:53am

There's not really any more information available to Elasticsearch beyond the fact that the transport connection between the nodes was unexpectedly dropped. Given that it's happening every few seconds it should be easy for your network folks to observe this directly. Pinging REST endpoints probably won't help, this problem doesn't involve the REST layer.

michael_0815 · July 23, 2020, 10:06am

Okay Thanks @DavidTurner

atm we´re analyzing with AKS K8s people.

One thing I observed is that the health endpoint answers very slow every minute. See the response time.

0.008[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
0.007[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
0.008[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
5.517[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
0.006[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent

DavidTurner · July 23, 2020, 11:03am

That seems unsurprising, but I don't expect it'll help much with your investigations.

michael_0815 · July 24, 2020, 8:08am

It seems that the linkerd caused that problems, removing the sidecar works fine now.
If anyone has this problems - check your stateful set for:

    linkerd.io/inject: enabled

....

system · August 21, 2020, 8:08am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed, restarting discovery Master not available Elasticsearch	13	4232	November 19, 2019
Two master nodes loose connection to the third master node, after its IP changes Elasticsearch	8	609	March 28, 2022
Nodes disconnected randomly Elasticsearch painless	1	354	September 19, 2022
Nodes continuously leaving and rejoining the cluster in 7.1 cluster after master switch Elasticsearch	8	2147	October 15, 2020
Master role does not switch automaticaly Elasticsearch	17	2152	February 17, 2020

Master Node toggles permanently

Related topics