I have a 5 node cluster, 3 masters on RHEL VM's and 2 data nodes on RHEL bare metal. The master nodes are able to connect to each other fine, and show a cluster status of green. The data nodes are able to connect but then fail with the error:
timed out while waiting for initial discovery state - timeout: 30s
Here are the configs:
cluster.name: ${ES_CLUSTER_NAME}
#
node.name: ${HOSTNAME}
node.master: true
node.data: false
#
path.data: ${ES_DATA_PATH}
#
# Path to log files:
#
path.logs: ${ES_LOG_PATH}
#
# Path to backup directory
path.repo: ${ES_PATH_REPO}
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
bootstrap.memory_lock: ${ES_DISABLE_MEMORY_SWAPPING}
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: "_site_ , _local_"
#
# Set a custom port for HTTP:
#
http.port: ${ES_HTTP_PORT}
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: "hostname1, hostname2, hostname3, hostname4, hostname5"
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 5
Data node has same configs except node.master and node.data are flipped.
Logs from data node side:
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,114][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] using profile[default], worker_count[64], port[9300-9400], bind_host[[_site_, _local_]], publish_host[[]], compress[false], connect_timeout[30s], connections_per_node[2/3/6/0/1], receive_predictor[64kb->64kb]
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,122][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] binding server bootstrap to: [::1, 127.0.0.1, 123456, 1234456, 1234456]
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,278][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] Bound profile [default] to address {[::1]:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,279][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] Bound profile [default] to address {127.0.0.1:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,280][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] Bound profile [default] to address {123456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,280][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] Bound profile [default] to address {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,281][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] Bound profile [default] to address {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,284][INFO ][o.e.t.TransportService ] [hostname.example.com] publish_address {10.25.41.12:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}, {1234456:9300}, {1234456:9300}, {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,292][INFO ][o.e.b.BootstrapChecks ] [hostname.example.com] bound or publishing to a non-loopback address, enforcing bootstrap checks
Aug 01 15:20:06 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:06,684][DEBUG][o.e.t.n.Netty4Transport ] [hostname.example.com] connected to node [{MASTERNODE_HOSTNAME}{-WYkalJ7SqqTq80b42azrQ}{Y6f_c3V0RNCoeUtQiOaQIg}{xxxxx}{xxxxx:9300}]
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,315][WARN ][o.e.n.Node ] [hostname.example.com] timed out while waiting for initial discovery state - timeout: 30s
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,335][INFO ][o.e.h.n.Netty4HttpServerTransport] [hostname.example.com] publish_address {xxxxxx:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}, {xxxxxxx:9200}, {xxxxxxx:9200}, {xxxxxxx:9200}
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,335][INFO ][o.e.n.Node ] [hostname.example.com] started
Aug 01 15:20:36 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:36,710][INFO ][o.e.d.z.ZenDiscovery ] [hostname.example.com] failed to send join request to master [{MASTERNODE_HOSTNAME}{-WYkalJ7SqqTq80b42azrQ}{Y6f_c3V0RNCoeUtQiOaQIg}{xxxxxxx}{xxxxxx:9300}], reason [RemoteTransportException[[MASTERNODE_HOSTNAME][xxxxxx:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[hostname.example.com][xxxxxx:9300] connect_timeout[30s]]; ]
Master node shows no relevant logs. I turned on the tracer and can see that there are multiple requests received and responses sent to the data nodes.