Failed to join a cluster because of time-out

I'm trying to build ElasticSearch cluster but it cause an error.

Log for master node (All IPs and comments were omitted due to privacy)

[2020-06-23T16:33:47,361][WARN ][o.e.c.c.Coordinator      ] [kn-log-01] failed to validate incoming join request from node [{kn-log-02}{tuCA1_YARK-HkHyzbpG4Nw}{0yZHEJGAQpKgWw336U2vDQ}{127.0.0.2}{127.0.0.2:9300}{dilrt}{ml.machine_memory=134888939520, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [kn-log-02][127.0.0.2:9300][internal:cluster/coordination/join/validate] request_id [88] timed out after [59835ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1041) [elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

Log for data node to join

org.elasticsearch.transport.RemoteTransportException: [kn-log-01][127.0.0.1:9300][internal:cluster/coordination/join]
Caused by: java.lang.IllegalStateException: failure when sending a validation request to node
        at org.elasticsearch.cluster.coordination.Coordinator$2.onFailure(Coordinator.java:514) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1139) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.transport.TransportService$8.run(TransportService.java:1001) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [kn-log-02][127.0.0.2:9300][internal:cluster/coordination/join/validate] disconnected
[2020-06-23T16:41:47,433][WARN ][o.e.c.c.ClusterFormationFailureHelper] [kn-log-02] master not discovered yet: have discovered [{kn-log-02}{tuCA1_YARK-HkHyzbpG4Nw}{0yZHEJGAQpKgWw336U2vDQ}{127.0.0.2}{127.0.0.2:9300}{dilrt}{ml.machine_memory=134888939520, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; discovery will continue using [127.0.0.1:9300, 127.0.0.3:9300, 127.0.0.4:9300] from hosts providers and [] from last-known cluster state; node term 1, last-accepted version 0 in term 0
[2020-06-23T16:41:57,434][WARN ][o.e.c.c.ClusterFormationFailureHelper] [kn-log-02] master not discovered yet: have discovered [{kn-log-02}{tuCA1_YARK-HkHyzbpG4Nw}{0yZHEJGAQpKgWw336U2vDQ}{127.0.0.2}{127.0.0.2:9300}{dilrt}{ml.machine_memory=134888939520, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}]; discovery will continue using [127.0.0.1:9300, 127.0.0.3:9300, 127.0.0.4:9300] from hosts providers and [] from last-known cluster state; node term 1, last-accepted version 0 in term 0

The node trying to request joining every minutes but caues time-out error. It doesn't work now but yesterday did without changing any settings about ElasticSearch (maybe).

elasticsearch.yml for master node

cluster.name: mycluster
node.name: kn-log-01
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
discovery.seed_hosts: ["127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4"]
cluster.initial_master_nodes: ["kn-log-01"]
node.master: true
node.data: true

elasticsearch.yml for data node

cluster.name: mycluster
node.name: kn-log-02
path.data: /data/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
discovery.seed_hosts: ["127.0.0.1", "127.0.0.2", "127.0.0.3", "127.0.0.4"]
cluster.initial_master_nodes: ["kn-log-01"]
node.master: false
node.data: true
$  curl -XGET 127.0.0.1:9200
{
  "name" : "kn-log-01",
  "cluster_name" : "mycluster",
  "cluster_uuid" : "jN-0FJwDRZqlAtQ6LpXwug",
  "version" : {
    "number" : "7.7.0",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "81a1e9eda8e6183f5237786246f6dced26a10eaf",
    "build_date" : "2020-05-12T02:01:37.602180Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
$ curl -XGET 127.0.0.2:9200
{
  "name" : "kn-log-02",
  "cluster_name" : "mycluster",
  "cluster_uuid" : "_na_",
  "version" : {
    "number" : "7.7.0",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "81a1e9eda8e6183f5237786246f6dced26a10eaf",
    "build_date" : "2020-05-12T02:01:37.602180Z",
    "build_snapshot" : false,
    "lucene_version" : "8.5.1",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}
$ curl -XGET 127.0.0.1:9200/_cat/nodes?v
ip             heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
127.0.0.1           15           2   0    0.01    0.03     0.05 dilmrt    *      kn-log-01

What I did already:

  • Checking firewalld settings about 9200, 9300 port again.
  • Rebooting all machines.
  • Wipe ElasticSearch data folders and restart services.

Could you telnet other x.x.x.x 9300 on each host?

Telnet 9300 port working well including to master node and to data node when ElasticSearch service turned on.

When I changed kn-log-02 to master eligible node, I got a following log.

[2020-06-25T14:10:13,181][WARN ][o.e.t.TransportService   ] [kn-log-02] Received response for a request that has timed out, sent [60036ms] ago, timed out [0ms] ago, action [internal:cluster/coordination/join], node [{kn-log-01}{2vdl6zdaTDWFjeGc9D6boQ}{8TNI9H74SFK1lKxH0VoTxA}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=134888939520, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}], id [129]
[2020-06-25T14:10:22,139][INFO ][o.e.c.c.JoinHelper       ] [kn-log-02] last failed join attempt was 8.9s ago, failed to join {kn-log-01}{2vdl6zdaTDWFjeGc9D6boQ}{8TNI9H74SFK1lKxH0VoTxA}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=134888939520, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} with JoinRequest{sourceNode={kn-log-02}{o7ei8kSHTLGaLlTXxGzenw}{KkL6tshFSM6lmS5by9CIbw}{127.0.0.2}{127.0.0.2:9300}{dilmrt}{ml.machine_memory=134888873984, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, minimumTerm=1, optionalJoin=Optional[Join{term=1, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={kn-log-02}{o7ei8kSHTLGaLlTXxGzenw}{KkL6tshFSM6lmS5by9CIbw}{127.0.0.2}{127.0.0.2:9300}{dilmrt}{ml.machine_memory=134888873984, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, targetNode={kn-log-01}{2vdl6zdaTDWFjeGc9D6boQ}{8TNI9H74SFK1lKxH0VoTxA}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=134888939520, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [kn-log-01][127.0.0.1:9300][internal:cluster/coordination/join] request_id [129] timed out after [60036ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1041) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.7.0.jar:7.7.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
[2020-06-25T14:10:22,141][WARN ][o.e.c.c.ClusterFormationFailureHelper] [kn-log-02] master not discovered yet, this node has not previously joined a bootstrapped (v7+) cluster, and this node must discover master-eligible nodes [kn-log-01] to bootstrap a cluster: have discovered [{kn-log-02}{o7ei8kSHTLGaLlTXxGzenw}{KkL6tshFSM6lmS5by9CIbw}{127.0.0.2}{127.0.0.2:9300}{dilmrt}{ml.machine_memory=134888873984, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}, {kn-log-01}{2vdl6zdaTDWFjeGc9D6boQ}{8TNI9H74SFK1lKxH0VoTxA}{127.0.0.1}{127.0.0.1:9300}{dilmrt}{ml.machine_memory=134888939520, ml.max_open_jobs=20, xpack.installed=true, transform.node=true}]; discovery will continue using [127.0.0.1:9300, 127.0.0.3:9300, 127.0.0.4:9300] from hosts providers and [{kn-log-02}{o7ei8kSHTLGaLlTXxGzenw}{KkL6tshFSM6lmS5by9CIbw}{127.0.0.2}{127.0.0.2:9300}{dilmrt}{ml.machine_memory=134888873984, xpack.installed=true, transform.node=true, ml.max_open_jobs=20}] from last-known cluster state; node term 1, last-accepted version 0 in term 0

I feel it is very strange. The log says this node must discover kn-log-01 to bootstrap: but have discoverd [kn-log-02, kn-log-01]

I re-installed ELK stacks now server #1 and #2 can build a cluster but #3 and #4 can't.

I got a new log message when trying to connect 3rd, 4th server.

[2020-06-26T12:48:28,802][WARN ][o.e.t.TcpTransport       ] [kn-log-01] invalid internal transport message format, got (ff,f4,ff,fd), [Netty4TcpChannel{localAddress=/127.0.0.1:9300, remoteAddress=/127.0.0.4:51254}], closing connection

This log doesn't appear before reinstall. I'm not sure this log related the issue but I'm trying to resolve from here.

Sorry, not know what's going on. There is an issue about this log message.

Solved finally. An issue was physical network problem.

MTU of the ethernet card was configured with value that hardware do not support. So I fix it then now it works.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.