Unable to connect data node on metal to master node on VM

I have a 5 node cluster, 3 masters on RHEL VM's and 2 data nodes on RHEL bare metal. The master nodes are able to connect to each other fine, and show a cluster status of green. The data nodes are able to connect but then fail with the error:
timed out while waiting for initial discovery state - timeout: 30s

Here are the configs:


cluster.name: ${ES_CLUSTER_NAME}
#
node.name: ${HOSTNAME}
node.master: true
node.data: false
#
path.data: ${ES_DATA_PATH}
#
# Path to log files:
#
path.logs: ${ES_LOG_PATH}
#
# Path to backup directory
path.repo: ${ES_PATH_REPO}
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
bootstrap.memory_lock: ${ES_DISABLE_MEMORY_SWAPPING}
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: "_site_ , _local_"
#
# Set a custom port for HTTP:
#
http.port: ${ES_HTTP_PORT}
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: "hostname1, hostname2, hostname3, hostname4, hostname5"
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 5

Data node has same configs except node.master and node.data are flipped.

Logs from data node side:

Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,114][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] using profile[default], worker_count[64], port[9300-9400], bind_host[[_site_, _local_]], publish_host[[]], compress[false], connect_timeout[30s], connections_per_node[2/3/6/0/1], receive_predictor[64kb->64kb]
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,122][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] binding server bootstrap to: [::1, 127.0.0.1, 123456, 1234456, 1234456]
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,278][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {[::1]:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,279][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {127.0.0.1:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,280][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {123456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,280][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,281][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,284][INFO ][o.e.t.TransportService   ] [hostname.example.com] publish_address {10.25.41.12:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}, {1234456:9300}, {1234456:9300}, {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,292][INFO ][o.e.b.BootstrapChecks    ] [hostname.example.com] bound or publishing to a non-loopback address, enforcing bootstrap checks
Aug 01 15:20:06 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:06,684][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] connected to node [{MASTERNODE_HOSTNAME}{-WYkalJ7SqqTq80b42azrQ}{Y6f_c3V0RNCoeUtQiOaQIg}{xxxxx}{xxxxx:9300}]
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,315][WARN ][o.e.n.Node               ] [hostname.example.com] timed out while waiting for initial discovery state - timeout: 30s
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,335][INFO ][o.e.h.n.Netty4HttpServerTransport] [hostname.example.com] publish_address {xxxxxx:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}, {xxxxxxx:9200}, {xxxxxxx:9200}, {xxxxxxx:9200}
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,335][INFO ][o.e.n.Node               ] [hostname.example.com] started
Aug 01 15:20:36 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:36,710][INFO ][o.e.d.z.ZenDiscovery     ] [hostname.example.com] failed to send join request to master [{MASTERNODE_HOSTNAME}{-WYkalJ7SqqTq80b42azrQ}{Y6f_c3V0RNCoeUtQiOaQIg}{xxxxxxx}{xxxxxx:9300}], reason [RemoteTransportException[[MASTERNODE_HOSTNAME][xxxxxx:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[hostname.example.com][xxxxxx:9300] connect_timeout[30s]]; ]

Master node shows no relevant logs. I turned on the tracer and can see that there are multiple requests received and responses sent to the data nodes.

Can you ping from the bare metal to the VMs?

Yes, I can ping from bare metal to the VMs and vice versa. I am also able to telnet from one to another on both 9200 and 9300. I've done an nslookup on all of the nodes and am able to resolve the correct hostname and IP address.

network.host: "_site_ , _local_" was causing the bare metal nodes to bind to the wrong network interface. Setting this config to the correct IP address directly solved the issue.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.