ES node find master, but can't join the cluster

Hi,
These days I have found such a situation: one data node of the cluster has discoveried three master nodes, but it can't join the cluster. However, when I restart it, the data node joins successful !!!

ES version:

7.5.1

elasticsearch.yml

bootstrap.memory_lock: true
cluster.max_shards_per_node: 10000
cluster.name: billions-uat7.5.1
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.cluster_concurrent_rebalance: 5
cluster.routing.allocation.node_concurrent_recoveries: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 20
cluster.routing.allocation.same_shard.host: true
discovery.seed_hosts:
- hw-sh-t-opslog-01:9310
- hw-sh-t-opslog-02:9310
- hw-sh-t-opslog-03:9310
http.port: 9201
indices.recovery.max_bytes_per_sec: 500mb
network.host: 0.0.0.0
node.attr.box_type: stale
node.data: true
node.master: false
transport.tcp.port: 9301
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.keystore.path: /etc/elasticsearch/datanode_stale/billions-certificates.p12
xpack.security.transport.ssl.truststore.path: /etc/elasticsearch/datanode_stale/billions-certificates.p12
xpack.security.transport.ssl.verification_mode: certificate

node.name: hw-sh-t-opslog-09-datanode_stale

path.data: /mnt/storage01/hw-sh-t-opslog-09-datanode_stale
path.logs: /mnt/storage01/elasticsearch/log/hw-sh-t-opslog-09-datanode_stale

action.auto_create_index: true
xpack.security.enabled: true

error log:

https://github.com/cjx-great/issues/tree/main/es-7.5.1/master_not_discovered

exporter monitor

image

I want to know what caused this, and how to solve it !

thanks :smile:

In fact the node is joining the cluster fine, but then leaving again shortly afterwards. This message tells us why:

node-left[{hw-sh-t-opslog-09-datanode_s tale}{b3mQRikmS1a4mXmyHKXu8g}{M3gSzZE6QnGFuQc1v_lEYQ}{10.221.46.135}{10.221.46.135:9301}{dil}{ml.machine_memory=67560857600, ml.m ax_open_jobs=20, xpack.installed=true, box_type=stale} reason: disconnected], term: 680, version: 991417, delta: removed {{hw-sh- t-opslog-09-datanode_stale}{b3mQRikmS1a4mXmyHKXu8g}{M3gSzZE6QnGFuQc1v_lEYQ}{10.221.46.135}{10.221.46.135:9301}{dil}{ml.machine_me mory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}

The reason: disconnected bit indicates that something outside Elasticsearch was disrupting the TCP connections from the master to the problematic node.

1 Like

but when we restart it, the data node joins successful

Right, but it'll likely still get disconnected. That's what you are seeing?

It's likely some firewall somewhere that isn't allowing long lived connections.

:ok_hand: I'll go to check the network side monitoring

thanks a lot

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.