ES node find master, but can't join the cluster

These days I have found such a situation: one data node of the cluster has discoveried three master nodes, but it can't join the cluster. However, when I restart it, the data node joins successful !!!

ES version:



bootstrap.memory_lock: true
cluster.max_shards_per_node: 10000 billions-uat7.5.1
cluster.routing.allocation.allow_rebalance: always
cluster.routing.allocation.cluster_concurrent_rebalance: 5
cluster.routing.allocation.node_concurrent_recoveries: 10
cluster.routing.allocation.node_initial_primaries_recoveries: 20 true
- hw-sh-t-opslog-01:9310
- hw-sh-t-opslog-02:9310
- hw-sh-t-opslog-03:9310
http.port: 9201
indices.recovery.max_bytes_per_sec: 500mb
node.attr.box_type: stale true
node.master: false
transport.tcp.port: 9301 true /etc/elasticsearch/datanode_stale/billions-certificates.p12 /etc/elasticsearch/datanode_stale/billions-certificates.p12 certificate hw-sh-t-opslog-09-datanode_stale /mnt/storage01/hw-sh-t-opslog-09-datanode_stale
path.logs: /mnt/storage01/elasticsearch/log/hw-sh-t-opslog-09-datanode_stale

action.auto_create_index: true true

error log:

exporter monitor


I want to know what caused this, and how to solve it !

thanks :smile:

In fact the node is joining the cluster fine, but then leaving again shortly afterwards. This message tells us why:

node-left[{hw-sh-t-opslog-09-datanode_s tale}{b3mQRikmS1a4mXmyHKXu8g}{M3gSzZE6QnGFuQc1v_lEYQ}{}{}{dil}{ml.machine_memory=67560857600, ml.m ax_open_jobs=20, xpack.installed=true, box_type=stale} reason: disconnected], term: 680, version: 991417, delta: removed {{hw-sh- t-opslog-09-datanode_stale}{b3mQRikmS1a4mXmyHKXu8g}{M3gSzZE6QnGFuQc1v_lEYQ}{}{}{dil}{ml.machine_me mory=67560857600, ml.max_open_jobs=20, xpack.installed=true, box_type=stale}}

The reason: disconnected bit indicates that something outside Elasticsearch was disrupting the TCP connections from the master to the problematic node.

but when we restart it, the data node joins successful

Right, but it'll likely still get disconnected. That's what you are seeing?

It's likely some firewall somewhere that isn't allowing long lived connections.

:ok_hand: I'll go to check the network side monitoring

thanks a lot

