How many master node should be in configuration

Hi I've developed such configuration
host1
2 x master node
3x data hot
9x data warm

host2
2 x master node
3x data hot
9x data warm

in the docker logs I don't see any suspicious but I can't set up a kibana_system
nor I didn't get info from

curl -k http://10.244.12.241:9202/_cat/master?v -u elastic
Enter host password for user 'elastic':



{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}[elasticsearch@srv24246anm-kvm kickstart_elk_cluster]$

What's wrong?

I saw such things

{"@timestamp":"2022-04-25T21:34:07.340Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [es_master_1_1, es_master_2_1, es_master_1_2, es_master_2_2] to bootstrap a cluster: have discovered [{es_master_2_1}{8nM0u-YoR9ieVudd9WT9mA}{idycjo8tTrG7pyRtQET-TQ}{10.0.0.95}{10.0.0.95:9300}{m}]; discovery will continue using [10.0.9.45:9300, 10.0.9.95:9300, 10.0.9.105:9300, 10.0.9.78:9300, 10.0.9.68:9300, 10.0.9.56:9300, 10.0.9.64:9300, 10.0.9.60:9300, 10.0.9.48:9300, 10.0.9.52:9300, 10.0.9.70:9300, 10.0.9.74:9300, 10.0.9.76:9300, 10.0.9.58:9300, 10.0.9.50:9300, 10.0.9.66:9300, 10.0.9.62:9300, 10.0.9.54:9300, 10.0.9.113:9300, 10.0.9.91:9300, 10.0.9.83:9300, 10.0.9.107:9300, 10.0.9.101:9300, 10.0.9.87:9300, 10.0.9.85:9300, 10.0.9.109:9300, 10.0.9.80:9300, 10.0.9.99:9300, 10.0.9.97:9300, 10.0.9.93:9300, 10.0.9.103:9300, 10.0.9.111:9300, 10.0.9.89:9300] from hosts providers and [{es_master_2_1}{8nM0u-YoR9ieVudd9WT9mA}{idycjo8tTrG7pyRtQET-TQ}{10.0.0.95}{10.0.0.95:9300}{m}] from last-known cluster state; node term 0, last-accepted version 0 in term 0", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_2_1][generic][T#11]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"es_master_2_1","elasticsearch.cluster.name":"elk_cluster"}

but my config looks good:

- cluster.name=elk_cluster
      - discovery.seed_hosts=es_master_2_1,es_master_1_2,es_master_2_2,es_data_ssd_3_1_ingest,es_data_ssd_1_1,es_data_ssd_2_1,es_data_ssd_3_1,es_data_ssd_4_1,es_data_ssd_5_1,es_data_hdd_1_1,es_data_hdd_2_1,es_data_hdd_3_1,es_data_hdd_4_1,es_data_hdd_5_1,es_data_hdd_6_1,es_data_hdd_7_1,es_data_hdd_8_1,es_data_hdd_9_1,es_data_ssd_3_2_ingest,es_data_ssd_1_2,es_data_ssd_2_2,es_data_ssd_3_2,es_data_ssd_4_2,es_data_ssd_5_2,es_data_hdd_1_2,es_data_hdd_2_2,es_data_hdd_3_2,es_data_hdd_4_2,es_data_hdd_5_2,es_data_hdd_6_2,es_data_hdd_7_2,es_data_hdd_8_2,es_data_hdd_9_2
      - cluster.initial_master_nodes=es_master_1_1, es_master_2_1, es_master_1_2, es_master_2_2

to be safe you need three
because even number node can create split brain situation.

so I've extended master nodes from 2 masters per host to 3 per hosts
but still got

{"@timestamp":"2022-04-25T22:30:59.772Z", "log.level": "WARN", "message":"master not discovered yet, this node has not previously joined a bootstrapped cluster, and this node must discover master-eligible nodes [es_master_1_1, es_master_2_1, es_master_1_2, es_master_2_2, es_master_3_1, es_master_3_2] to bootstrap a cluster: have discovered [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{Ax60i-OBQEqeqXriOeGypQ}{10.0.0.206}{10.0.0.206:9300}{m}]; discovery will continue using [10.0.9.170:9300, 10.0.9.223:9300, 10.0.9.219:9300, 10.0.9.188:9300, 10.0.9.184:9300, 10.0.9.186:9300, 10.0.9.178:9300, 10.0.9.164:9300, 10.0.9.161:9300, 10.0.9.176:9300, 10.0.9.194:9300, 10.0.9.172:9300, 10.0.9.190:9300, 10.0.9.174:9300, 10.0.9.180:9300, 10.0.9.168:9300, 10.0.9.196:9300, 10.0.9.166:9300, 10.0.9.198:9300, 10.0.9.205:9300, 10.0.9.213:9300, 10.0.9.221:9300, 10.0.9.209:9300, 10.0.9.211:9300, 10.0.9.207:9300, 10.0.9.227:9300, 10.0.9.233:9300, 10.0.9.201:9300, 10.0.9.203:9300, 10.0.9.231:9300, 10.0.9.217:9300, 10.0.9.225:9300, 10.0.9.215:9300, 10.0.9.182:9300, 10.0.9.229:9300] from hosts providers and [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{Ax60i-OBQEqeqXriOeGypQ}{10.0.0.206}{10.0.0.206:9300}{m}] from last-known cluster state; node term 0, last-accepted version 0 in term 0", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_1_1][generic][T#8]","log.logger":"org.elasticsearch.cluster.coordination.ClusterFormationFailureHelper","elasticsearch.node.name":"es_master_1_1","elasticsearch.cluster.name":"elk_cluster"}

and I'm pretty sure that we don't have any firewall block

from iptables it seems that we have already opened ports

Chain DOCKER-INGRESS (1 references)
target     prot opt source               destination
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9331
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9331
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9231
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9231
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9334
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9334
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9234
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9234
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9339
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9339
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9239
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9239
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9330
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9330
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9230
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9230
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9336
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9336
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9236
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9236
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9320
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9320
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9220
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            state RELATED,ESTABLISHED tcp spt:9220
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:9326

uff after 1 minute (it took discovery process) were connected :slight_smile:

now just make sure all is there

curl -u user:password -XGET master_name:9200/_cat/nodes?v

curl -u elastic:password -XGET http://10.250.131.225:9202/_cat/nodes?v
{"error":{"root_cause":[{"type":"master_not_discovered_exception","reason":null}],"type":"master_not_discovered_exception","reason":null},"status":503}[Elasticsearch@srv24246anm-kvm kickstart_elk_cluster]$

in the meantime I find that:

{"@timestamp":"2022-04-25T23:53:01.623Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.187:9300]] completed handshake with [{es_master_3_2}{utHeMP5oQGeb4VWyZHBh4g}{1LZOSe_PSI-3PGbGpWlgiA}{10.0.0.197}{10.0.0.197:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#13]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_3_2][10.0.0.197:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_3_2][10.0.0.197:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
{"@timestamp":"2022-04-25T23:53:19.762Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.136:9300]] completed handshake with [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{6vIGdALqQWmD4g6g8OUAkQ}{10.0.0.147}{10.0.0.147:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#12]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_1_1][10.0.0.147:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_1_1][10.0.0.147:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
{"@timestamp":"2022-04-25T23:53:29.836Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.183:9300]] completed handshake with [{es_master_2_2}{29iHv6-bQPKrbuh84GcElA}{_Pj7x2MSQCWw8tL9eYyFVw}{10.0.0.193}{10.0.0.193:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#15]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_2_2][10.0.0.193:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_2_2][10.0.0.193:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}

but now discover result looks good


{"@timestamp":"2022-04-25T23:58:09.185Z", "log.level": "WARN", "message":"address [10.0.9.154:9300], node [null], requesting [false] discovery result: [es_data_hdd_3_1][10.0.0.165:9300] successfully discovered master-ineligible node {es_data_hdd_3_1}{tCHKsyhdTgaHa5b_JtVuAw}{tX_Un7PkR9S9QJfq7xRlVA}{10.0.0.165}{10.0.0.165:9300}{w} at [10.0.9.154:9300]; to suppress this message, remove address [10.0.9.154:9300] from your discovery configuration or ensure that traffic to this address is routed only to master-eligible nodes", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#31]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster"}
{"@timestamp":"2022-04-25T23:58:09.187Z", "log.level": "WARN", "message":"address [10.0.9.158:9300], node [null], requesting [false] discovery result: [es_data_ssd_1_1][10.0.0.169:9300] successfully discovered master-ineligible node {es_data_ssd_1_1}{U-tbIYFYSX63MZFdleIi4A}{xb9PROnHSau9PaVsGeWaWg}{10.0.0.169}{10.0.0.169:9300}{hs} at [10.0.9.158:9300]; to suppress this message, remove address [10.0.9.158:9300] from your discovery configuration or ensure that traffic to this address is routed only to master-eligible nodes", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#8]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster"}
{"@timestamp":"2022-04-25T23:58:09.188Z", "log.level": "WARN", "message":"address [10.0.9.144:9300], node [null], requesting [false] discovery result: [es_data_ssd_3_1][10.0.0.155:9300] successfully discovered master-ineligible node {es_data_ssd_3_1}{eqPk7f-vTGGHkxrAxJmAFQ}{-1Ix2kadTGa0bQBmCLSDSw}{10.0.0.155}{10.0.0.155:9300}{hs} at [10.0.9.144:9300]; to suppress this message, remove address [10.0.9.144:9300] from your discovery configuration or ensure that traffic to this address is routed only to master-eligible nodes", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_3_1][generic][T#23]","log.logger":"org.elasticsearch.discovery.PeerFinder","elasticsearch.node.name":"es_master_3_1","elasticsearch.cluster.name":"elk_cluster"}

so something is still left ....

suggestions for advanced troubleshooting appreciated

curl -u elastic:o8toW7nf64DAy3 -XGET http://10.250.131.225:9202
{
"name" : "es_master_1_1",
"cluster_name" : "elk_cluster",
"cluster_uuid" : "na",
"version" : {
"number" : "8.1.0",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "3700f7679f7d95e36da0b43762189bab189bc53a",
"build_date" : "2022-03-03T14:20:00.690422633Z",
"build_snapshot" : false,
"lucene_version" : "9.0.0",
"minimum_wire_compatibility_version" : "7.17.0",
"minimum_index_compatibility_version" : "7.0.0"
},
"tagline" : "You Know, for Search"
}

what can I more explain ,
10.0.0.5:9300 this IP comes from docker swarm ingress network, so in this point we have a problem I think so

but "[10.0.9.2:9300]] completed handshake with" comes from mynet network and it works

at least docker swarm network is issuer


{"@timestamp":"2022-04-26T00:40:01.104Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.2:9300]] completed handshake with [{es_master_1_1}{q23IasYaQse8GgbYhQ8QVA}{5tpB5uS9R12MOqrBqlXIgw}{10.0.0.5}{10.0.0.5:9300}{m}{xpack.installed=true}] but followup connection failed", "ecs.version": "1.2.0","service.name":"ES_ECS","event.dataset":"elasticsearch.server","process.thread.name":"elasticsearch[es_master_2_1][generic][T#23]","log.logger":"org.elasticsearch.discovery.HandshakingTransportAddressConnector","elasticsearch.node.name":"es_master_2_1","elasticsearch.cluster.name":"elk_cluster","error.type":"org.elasticsearch.transport.ConnectTransportException","error.message":"[es_master_1_1][10.0.0.5:9300] connect_timeout[30s]","error.stack_trace":"org.elasticsearch.transport.ConnectTransportException: [es_master_1_1][10.0.0.5:9300] connect_timeout[30s]\n\tat org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:1113)\n\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:717)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base/java.lang.Thread.run(Thread.java:833)\n"}
{"@timestamp":"2022-04-26T00:40:07.127Z", "log.level": "WARN", "message":"[connectToRemoteMasterNode[10.0.9.48:9300]]
elasticsearch@bde3eeaf65bc:~$ curl -v http://10.0.9.60:9300
*   Trying 10.0.9.60:9300...
* TCP_NODELAY set
* Connected to 10.0.9.60 (10.0.9.60) port 9300 (#0)
> GET / HTTP/1.1
> Host: 10.0.9.60:9300
> User-Agent: curl/7.68.0
> Accept: */*
>
* Empty reply from server
* Connection #0 to host 10.0.9.60 left intact
curl: (52) Empty reply from server
elasticsearch@bde3eeaf65bc:~$ curl -v http://10.0.0.61:9300
*   Trying 10.0.0.61:9300...
* TCP_NODELAY set

If you are looking for high availability you will need three hosts. Having 3 master nodes per host instead of 2 will not help.

I found the soultion if You have a problem with ingress
this is short workaround:

1.docker swarm init --advertise-addr 10.244.12.241

2. `docker network rm ingress`

3. docker network create \
--driver overlay \
--ingress \
--subnet=10.255.0.0/16 \
--gateway=10.255.0.1 \
my-ingress

4. systemctl restart docker

it was tested on Docker version 20.10.14, build a224086 and Elasticsearch in 8.1.3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.