Unable to connect data node on metal to master node on VM

some_dood · August 1, 2018, 9:06pm

I have a 5 node cluster, 3 masters on RHEL VM's and 2 data nodes on RHEL bare metal. The master nodes are able to connect to each other fine, and show a cluster status of green. The data nodes are able to connect but then fail with the error:
timed out while waiting for initial discovery state - timeout: 30s

Here are the configs:


cluster.name: ${ES_CLUSTER_NAME}
#
node.name: ${HOSTNAME}
node.master: true
node.data: false
#
path.data: ${ES_DATA_PATH}
#
# Path to log files:
#
path.logs: ${ES_LOG_PATH}
#
# Path to backup directory
path.repo: ${ES_PATH_REPO}
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
bootstrap.memory_lock: ${ES_DISABLE_MEMORY_SWAPPING}
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: "_site_ , _local_"
#
# Set a custom port for HTTP:
#
http.port: ${ES_HTTP_PORT}
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts: "hostname1, hostname2, hostname3, hostname4, hostname5"
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 5

Data node has same configs except node.master and node.data are flipped.

Logs from data node side:

Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,114][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] using profile[default], worker_count[64], port[9300-9400], bind_host[[_site_, _local_]], publish_host[[]], compress[false], connect_timeout[30s], connections_per_node[2/3/6/0/1], receive_predictor[64kb->64kb]
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,122][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] binding server bootstrap to: [::1, 127.0.0.1, 123456, 1234456, 1234456]
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,278][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {[::1]:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,279][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {127.0.0.1:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,280][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {123456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,280][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,281][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] Bound profile [default] to address {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,284][INFO ][o.e.t.TransportService   ] [hostname.example.com] publish_address {10.25.41.12:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}, {1234456:9300}, {1234456:9300}, {1234456:9300}
Aug 01 15:20:03 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:03,292][INFO ][o.e.b.BootstrapChecks    ] [hostname.example.com] bound or publishing to a non-loopback address, enforcing bootstrap checks
Aug 01 15:20:06 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:06,684][DEBUG][o.e.t.n.Netty4Transport  ] [hostname.example.com] connected to node [{MASTERNODE_HOSTNAME}{-WYkalJ7SqqTq80b42azrQ}{Y6f_c3V0RNCoeUtQiOaQIg}{xxxxx}{xxxxx:9300}]
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,315][WARN ][o.e.n.Node               ] [hostname.example.com] timed out while waiting for initial discovery state - timeout: 30s
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,335][INFO ][o.e.h.n.Netty4HttpServerTransport] [hostname.example.com] publish_address {xxxxxx:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}, {xxxxxxx:9200}, {xxxxxxx:9200}, {xxxxxxx:9200}
Aug 01 15:20:33 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:33,335][INFO ][o.e.n.Node               ] [hostname.example.com] started
Aug 01 15:20:36 hostname.example.com elasticsearch[31465]: [2018-08-01T15:20:36,710][INFO ][o.e.d.z.ZenDiscovery     ] [hostname.example.com] failed to send join request to master [{MASTERNODE_HOSTNAME}{-WYkalJ7SqqTq80b42azrQ}{Y6f_c3V0RNCoeUtQiOaQIg}{xxxxxxx}{xxxxxx:9300}], reason [RemoteTransportException[[MASTERNODE_HOSTNAME][xxxxxx:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[hostname.example.com][xxxxxx:9300] connect_timeout[30s]]; ]

Master node shows no relevant logs. I turned on the tracer and can see that there are multiple requests received and responses sent to the data nodes.

warkolm · August 1, 2018, 9:15pm

Can you ping from the bare metal to the VMs?

some_dood · August 2, 2018, 1:32pm

Yes, I can ping from bare metal to the VMs and vice versa. I am also able to telnet from one to another on both 9200 and 9300. I've done an nslookup on all of the nodes and am able to resolve the correct hostname and IP address.

some_dood · August 2, 2018, 2:00pm

network.host: "_site_ , _local_" was causing the bare metal nodes to bind to the wrong network interface. Setting this config to the correct IP address directly solved the issue.

system · August 30, 2018, 2:00pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch cluster: node not able to connect to cluster Elasticsearch	1	869	July 5, 2017
Unable to connect to Master node from Data node in ElasticSearch Elasticsearch	6	1565	July 26, 2019
[data1] not enough master nodes discovered during pinging (found [[]], but needed [-1]) Elasticsearch	2	826	September 16, 2018
Error as an when data node is trying to connect to master node in elasticsearch clustering Elasticsearch	4	817	October 22, 2018
Multiple nodes on elasticsearch Elasticsearch	11	897	November 21, 2018

Unable to connect data node on metal to master node on VM

Related topics