Master Node toggles permanently

Hi there,

we have an issue that we have constantly toggling of the master node - it leaves and joins permanently. We deleted all nodes including pvc and started all over. Same issue in AKS again.
It´s a 3 node setup.

Log of one toggling elasticsearch pod:

{"type": "server", "timestamp": "2020-07-23T06:06:37,487Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [], current [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}]}, term: 639, version: 43472, reason: ApplyCommitRequest{term=639, version=43472, sourceNode={elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }
    {"type": "server", "timestamp": "2020-07-23T06:06:40,447Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug" ,
    "stacktrace": ["org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-master-2][10.86.242.116:9300][disconnected] disconnected"] }
    {"type": "server", "timestamp": "2020-07-23T06:06:40,448Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 639, version: 43472, reason: becoming candidate: onLeaderFailure", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }
    {"type": "server", "timestamp": "2020-07-23T06:06:40,506Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [], current [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}]}, term: 639, version: 43474, reason: ApplyCommitRequest{term=639, version=43474, sourceNode={elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }
    {"type": "server", "timestamp": "2020-07-23T06:06:43,451Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug" ,
    "stacktrace": ["org.elasticsearch.transport.NodeDisconnectedException: [elasticsearch-master-2][10.86.242.116:9300][disconnected] disconnected"] }
    {"type": "server", "timestamp": "2020-07-23T06:06:43,452Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-0", "message": "master node changed {previous [{elasticsearch-master-2}{svQtzZRwQgmnWze-0qaYmA}{Bjoo0SIrR5mAOmFzsbDl3Q}{10.86.242.116}{10.86.242.116:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}], current []}, term: 639, version: 43474, reason: becoming candidate: onLeaderFailure", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "4kp9Ml6QRpCmJAGE3Evfug"  }

log of the other pod:

{"type": "server", "timestamp": "2020-07-23T06:13:00,084Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-join[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 639, version: 43699, delta: added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:00,131Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}, term: 639, version: 43699, reason: Publication{term=639, version=43699}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:03,077Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-left[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} reason: disconnected], term: 639, version: 43701, delta: removed {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:03,104Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "removed {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}, term: 639, version: 43701, reason: Publication{term=639, version=43701}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:05,102Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-join[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} join existing leader], term: 639, version: 43702, delta: added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:05,142Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "added {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}, term: 639, version: 43702, reason: Publication{term=639, version=43702}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }
    {"type": "server", "timestamp": "2020-07-23T06:13:08,094Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master-2", "message": "node-left[{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true} reason: disconnected], term: 639, version: 43703, delta: removed {{elasticsearch-master-0}{4kp9Ml6QRpCmJAGE3Evfug}{tCdslyhKQ5Wmko3yOHJ5DQ}{10.86.242.63}{10.86.242.63:9300}{dilm}{ml.machine_memory=10737418240, ml.max_open_jobs=20, xpack.installed=true}}", "cluster.uuid": "3cuI01DzRaq114jQAwRQgA", "node.id": "svQtzZRwQgmnWze-0qaYmA"  }

We can´t see the reason for this constant toggling. Any idea why that can happen? Thanks a lot!

Something (outside of Elasticsearch) is breaking the network connection between the nodes.

1 Like

Hi @DavidTurner
thank you - that´s what we thought as well.
We can´t / don´t handle AKS network.
Any Trace level you suggest to set to get more details on the disconnect?
We played a bit with different Tracelevels but just got stacktraces.
Maybe I could call a pingable rest endpoint in the loop and ask AKS Team to trace that.
Any suggestion?

Thanks a lot

There's not really any more information available to Elasticsearch beyond the fact that the transport connection between the nodes was unexpectedly dropped. Given that it's happening every few seconds it should be easy for your network folks to observe this directly. Pinging REST endpoints probably won't help, this problem doesn't involve the REST layer.

Okay Thanks @DavidTurner

atm we´re analyzing with AKS K8s people.

One thing I observed is that the health endpoint answers very slow every minute. See the response time.

0.008[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
0.007[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
0.008[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
5.517[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent
0.006[elasticsearch@elasticsearch-master-0 ~]$ curl -XGET elasticsearch-master-2.elasticsearch-master-headless.monitoring.svc.cluster.local:9200  -w '%{time_total}' --output /dev/null --silent

That seems unsurprising, but I don't expect it'll help much with your investigations.

It seems that the linkerd caused that problems, removing the sidecar works fine now.
If anyone has this problems - check your stateful set for:

    linkerd.io/inject: enabled 

:wink: ....

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.