Ok, interesting. I still think we aren't seeing enough logs. The last master-node change is here, in which d2c-es-cluster-prod-master-1
is elected:
{"type": "server", "timestamp": "2019-10-16T20:09:09,534Z", "level": "INFO", "component": "o.e.c.s.MasterService", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "elected-as-master ([2] nodes joined)[{d2c-es-cluster-prod-master-0}{toxb8i8QTQqxV85UGFwAOA}{3eN9f7vVQnq4uirgonH3cQ}{10.253.142.32}{10.253.142.32:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} elect leader, {d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 2, version: 60, reason: master node changed {previous [], current [{d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20}]}", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" }
However I don't see messages indicating that this election has completely finished, or has timed out, and then the logs finish just a couple of minutes later. Can you collect logs for longer? Ideally I'd like to see the election succeed and the cluster quieten down again, or else see three or four elected-as-master
messages from retries of the elections.
There is some suggestion that cluster state publication is happening very slowly:
{"type": "server", "timestamp": "2019-10-16T20:06:19,442Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "after [10.1s] publication of cluster state version [59] is still waiting for {d2c-es-cluster-prod-master-0}{toxb8i8QTQqxV85UGFwAOA}{3eN9f7vVQnq4uirgonH3cQ}{10.253.142.32}{10.253.142.32:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-2}{hGtaxylJToeRvQ6fKQmfkg}{8PIeKHJzRW-Qrxr3Aohcww}{10.253.142.77}{10.253.142.77:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20} [WAITING_FOR_QUORUM]", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" }
However this isn't obviously a network issue, because the next publication has the same issue and this is just after an election:
{"type": "server", "timestamp": "2019-10-16T20:09:19,536Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "d2c-es-cluster-prod", "node.name": "d2c-es-cluster-prod-master-1", "message": "after [10s] publication of cluster state version [60] is still waiting for {d2c-es-cluster-prod-master-0}{toxb8i8QTQqxV85UGFwAOA}{3eN9f7vVQnq4uirgonH3cQ}{10.253.142.32}{10.253.142.32:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-2}{hGtaxylJToeRvQ6fKQmfkg}{8PIeKHJzRW-Qrxr3Aohcww}{10.253.142.77}{10.253.142.77:9300}{dilm}{ml.machine_memory=21474836480, ml.max_open_jobs=20, xpack.installed=true} [SENT_PUBLISH_REQUEST], {d2c-es-cluster-prod-master-1}{ZNJT576sQ5WcGAE8xa78Yw}{HBF9jRqiQxOLEmwLn7DlBw}{10.253.142.161}{10.253.142.161:9300}{dilm}{ml.machine_memory=21474836480, xpack.installed=true, ml.max_open_jobs=20} [WAITING_FOR_QUORUM]", "cluster.uuid": "5xgGnZjvRgGAvMAeqwI-WQ", "node.id": "ZNJT576sQ5WcGAE8xa78Yw" }