How many master node should be in configuration

Hi I've developed such configuration
host1
2 x master node
3x data hot
9x data warm

host2
2 x master node
3x data hot
9x data warm

in the docker logs I don't see any suspicious but I can't set up a kibana_system
nor I didn't get info from

curl -k http://10.244.12.241:9202/_cat/master?v -u elastic
Enter host password for user 'elastic':



{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}[elasticsearch@srv24246anm-kvm kickstart_elk_cluster]$

What's wrong?

I saw such things

{"@timestamp":"2022-04-25T21:34:07.340Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [es_master_1_1, es_master_2_1, es_master_1_2, es_master_2_2] to bootstrap a cluster: have discovered [{es_master_2_1}{8nM0u-YoR9ieVudd9WT9mA}{idycjo8tTrG7pyRtQET-TQ}{10.0.0.95}{10.0.0.95:9300}{m}]; discovery will continue using [10.0.9.45:9300, 10.0.9.95:9300, 10.0.9.105:9300, 10.0.9.78:9300, 10.0.9.68:9300, 10.0.9.56:9300, 10.0.9.64:9300, 10.0.9.60:9300, 10.0.9.48:9300, 10.0.9.52:9300, 10.0.9.70:9300, 10.0.9.74:9300, 10.0.9.76:9300, 10.0.9.58:9300, 10.0.9.50:9300, 10.0.9.66:9300, 10.0.9.62:9300, 10.0.9.54:9300, 10.0.9.113:9300, 10.0.9.91:9300, 10.0.9.83:9300, 10.0.9.107:9300, 10.0.9.101:9300, 10.0.9.87:9300, 10.0.9.85:9300, 10.0.9.109:9300, 10.0.9.80:9300, 10.0.9.99:9300, 10.0.9.97:9300, 10.0.9.93:9300, 10.0.9.103:9300, 10.0.9.111:9300, 10.0.9.89:9300] from hosts providers and [{es_master_2_1}{8nM0u-YoR9ieVudd9WT9mA}{idycjo8tTrG7pyRtQET-TQ}{10.0.0.95}{10.0.0.95:9300}{m}] from last-known cluster state; node term 0, last-accepted version 0 in term 0", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_2_1][generic][T#11]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"es_master_2_1","elasticsearch.cluster.name":"elk_cluster"}

but my config looks good:

- cluster.name=elk_cluster
      - discovery.seed_hosts=es_master_2_1,es_master_1_2,es_master_2_2,es_data_ssd_3_1_ingest,es_data_ssd_1_1,es_data_ssd_2_1,es_data_ssd_3_1,es_data_ssd_4_1,es_data_ssd_5_1,es_data_hdd_1_1,es_data_hdd_2_1,es_data_hdd_3_1,es_data_hdd_4_1,es_data_hdd_5_1,es_data_hdd_6_1,es_data_hdd_7_1,es_data_hdd_8_1,es_data_hdd_9_1,es_data_ssd_3_2_ingest,es_data_ssd_1_2,es_data_ssd_2_2,es_data_ssd_3_2,es_data_ssd_4_2,es_data_ssd_5_2,es_data_hdd_1_2,es_data_hdd_2_2,es_data_hdd_3_2,es_data_hdd_4_2,es_data_hdd_5_2,es_data_hdd_6_2,es_data_hdd_7_2,es_data_hdd_8_2,es_data_hdd_9_2
      - cluster.initial_master_nodes=es_master_1_1, es_master_2_1, es_master_1_2, es_master_2_2

to be safe you need three
because even number node can create split brain situation.

so I've extended master nodes from 2 masters per host to 3 per hosts
but still got

{"@timestamp":"2022-04-25T22:30:59.772Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [es_master_1_1, es_master_2_1, es_master_1_2, es_master_2_2, es_master_3_1, es_master_3_2] to bootstrap a cluster: have discovered [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{Ax60i-OBQEqeqXriOeGypQ}{10.0.0.206}{10.0.0.206:9300}{m}]; discovery will continue using [10.0.9.170:9300, 10.0.9.223:9300, 10.0.9.219:9300, 10.0.9.188:9300, 10.0.9.184:9300, 10.0.9.186:9300, 10.0.9.178:9300, 10.0.9.164:9300, 10.0.9.161:9300, 10.0.9.176:9300, 10.0.9.194:9300, 10.0.9.172:9300, 10.0.9.190:9300, 10.0.9.174:9300, 10.0.9.180:9300, 10.0.9.168:9300, 10.0.9.196:9300, 10.0.9.166:9300, 10.0.9.198:9300, 10.0.9.205:9300, 10.0.9.213:9300, 10.0.9.221:9300, 10.0.9.209:9300, 10.0.9.211:9300, 10.0.9.207:9300, 10.0.9.227:9300, 10.0.9.233:9300, 10.0.9.201:9300, 10.0.9.203:9300, 10.0.9.231:9300, 10.0.9.217:9300, 10.0.9.225:9300, 10.0.9.215:9300, 10.0.9.182:9300, 10.0.9.229:9300] from hosts providers and [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{Ax60i-OBQEqeqXriOeGypQ}{10.0.0.206}{10.0.0.206:9300}{m}] from last-known cluster state; node term 0, last-accepted version 0 in term 0", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_1_1][generic][T#8]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"es_master_1_1","elasticsearch.cluster.name":"elk_cluster"}

and I'm pretty sure that we don't have any firewall block

from iptables it seems that we have already opened ports

Chain DOCKER-INGRESS (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9331
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9331
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9231
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9231
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9334
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9334
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9234
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9234
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9339
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9339
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9239
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9239
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9330
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9330
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9230
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9230
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9336
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9336
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9236
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9236
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9320
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9320
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9220
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9220
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9326

uff after 1 minute (it took discovery process) were connected :slight_smile:

now just make sure all is there

curl -u user:password -XGET master_name:9200/_cat/nodes?v

curl -u elastic:password -XGET http://10.250.131.225:9202/_cat/nodes?v
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}[Elasticsearch@srv24246anm-kvm kickstart_elk_cluster]$

in the meantime I find that:

{"@timestamp":"2022-04-25T23:53:01.623Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.187:9300]] completed handshake with [{es_master_3_2}{utHeMP5oQGeb4VWyZHBh4g}{1LZOSe_PSI-3PGbGpWlgiA}{10.0.0.197}{10.0.0.197:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#13]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_3_2][10.0.0.197:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_3_2][10.0.0.197:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
{"@timestamp":"2022-04-25T23:53:19.762Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.136:9300]] completed handshake with [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{6vIGdALqQWmD4g6g8OUAkQ}{10.0.0.147}{10.0.0.147:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#12]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_1_1][10.0.0.147:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_1_1][10.0.0.147:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
{"@timestamp":"2022-04-25T23:53:29.836Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.183:9300]] completed handshake with [{es_master_2_2}{29iHv6-bQPKrbuh84GcElA}{_Pj7x2MSQCWw8tL9eYyFVw}{10.0.0.193}{10.0.0.193:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#15]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_2_2][10.0.0.193:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_2_2][10.0.0.193:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}

but now discover result looks good


{"@timestamp":"2022-04-25T23:58:09.185Z", "log.level": "WARN", "message":"address [10.0.9.154:9300], node [null], requesting [false] discovery result: [es_data_hdd_3_1][10.0.0.165:9300] successfully discovered master-ineligible node {es_data_hdd_3_1}{tCHKsyhdTgaHa5b_JtVuAw}{tX_Un7PkR9S9QJfq7xRlVA}{10.0.0.165}{10.0.0.165:9300}{w} at [10.0.9.154:9300]; to suppress this message, remove address [10.0.9.154:9300] from your discovery configuration or ensure that traffic to this address is routed only to master-eligible nodes", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#31]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster"}
{"@timestamp":"2022-04-25T23:58:09.187Z", "log.level": "WARN", "message":"address [10.0.9.158:9300], node [null], requesting [false] discovery result: [es_data_ssd_1_1][10.0.0.169:9300] successfully discovered master-ineligible node {es_data_ssd_1_1}{U-tbIYFYSX63MZFdleIi4A}{xb9PROnHSau9PaVsGeWaWg}{10.0.0.169}{10.0.0.169:9300}{hs} at [10.0.9.158:9300]; to suppress this message, remove address [10.0.9.158:9300] from your discovery configuration or ensure that traffic to this address is routed only to master-eligible nodes", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#8]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster"}
{"@timestamp":"2022-04-25T23:58:09.188Z", "log.level": "WARN", "message":"address [10.0.9.144:9300], node [null], requesting [false] discovery result: [es_data_ssd_3_1][10.0.0.155:9300] successfully discovered master-ineligible node {es_data_ssd_3_1}{eqPk7f-vTGGHkxrAxJmAFQ}{-1Ix2kadTGa0bQBmCLSDSw}{10.0.0.155}{10.0.0.155:9300}{hs} at [10.0.9.144:9300]; to suppress this message, remove address [10.0.9.144:9300] from your discovery configuration or ensure that traffic to this address is routed only to master-eligible nodes", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#23]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster"}

so something is still left ....

suggestions for advanced troubleshooting appreciated

curl -u elastic:o8toW7nf64DAy3 -XGET http://10.250.131.225:9202
{
"name" : "es_master_1_1",
"cluster_name" : "elk_cluster",
"cluster_uuid" : "na",
"version" : {
"number" : "8.1.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "3700f7679f7d95e36da0b43762189bab189bc53a",
"build_date" : "2022-03-03T14:20:00.690422633Z",
"build_snapshot" : false,
"lucene_version" : "9.0.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}

what can I more explain ,
10.0.0.5:9300 this IP comes from docker swarm ingress network, so in this point we have a problem I think so

but "[10.0.9.2:9300]] completed handshake with" comes from mynet network and it works

at least docker swarm network is issuer


{"@timestamp":"2022-04-26T00:40:01.104Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.2:9300]] completed handshake with [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{5tpB5uS9R12MOqrBqlXIgw}{10.0.0.5}{10.0.0.5:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_2_1][generic][T#23]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_2_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_1_1][10.0.0.5:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_1_1][10.0.0.5:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
{"@timestamp":"2022-04-26T00:40:07.127Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.48:9300]]
elasticsearch@bde3eeaf65bc:~$ curl -v http://10.0.9.60:9300
*   Trying 10.0.9.60:9300...
* TCP_NODELAY set
* Connected to 10.0.9.60 (10.0.9.60) port 9300 (#0)
> GET / HTTP/1.1
> Host: 10.0.9.60:9300
> User-Agent: curl/7.68.0
> Accept: */*
>
* Empty reply from server
* Connection #0 to host 10.0.9.60 left intact
curl: (52) Empty reply from server
elasticsearch@bde3eeaf65bc:~$ curl -v http://10.0.0.61:9300
*   Trying 10.0.0.61:9300...
* TCP_NODELAY set

If you are looking for high availability you will need three hosts. Having 3 master nodes per host instead of 2 will not help.

I found the soultion if You have a problem with ingress
this is short workaround:

1.docker swarm init --advertise-addr 10.244.12.241

2. `docker network rm ingress`

3. docker network create \
--driver overlay \
--ingress \
--subnet=10.255.0.0/16 \
--gateway=10.255.0.1 \
my-ingress

4. systemctl restart docker

it was tested on Docker version 20.10.14, build a224086 and Elasticsearch in 8.1.3