Failed to send join request to master in Elastic 6.4.0

Hi,
I am getting the Error of failed to send join request to master while running Elastic 6.4.0. Please see the below given log snapshot and help me.

Master Node Config

cluster.name: aabb
node.name: linsee1
node.master: true
node.data: false
path.data: /var/fpwork/workspace_smtrs/elasticsearch-6.4.0/data
path.logs: /var/fpwork/workspace_smtrs/elasticsearch-6.4.0/logs
#discovery.zen.ping.unicast.hosts: ["10.182.197.132","10.182.197.102","10.182.197.160"]
discovery.zen.ping.unicast.hosts: ["10.182.197.102:9500","10.182.197.132:9500","10.182.197.160:9500","10.159.252.99:9500"]
bootstrap.system_call_filter: false
action.auto_create_index: ".security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
network.host: 0.0.0.0
transport.tcp.compress: true
http.port : 9400
transport.tcp.port: 9500
discovery.zen.minimum_master_nodes: 1
network.publish_host: 0.0.0.0
network.bind_host: 0.0.0.0

Data Node 1

cluster.name: aabb
node.name: master
node.master: false
node.data: true
path.data: /var/fpwork/elastic_kibana/elasticsearch-6.4.0/data
path.logs: /var/fpwork/elastic_kibana/elasticsearch-6.4.0/logs
discovery.zen.ping.unicast.hosts: ["10.182.197.102:9500","10.182.197.132:9500","10.182.197.160:9500","10.159.252.99:9500"]
bootstrap.system_call_filter: false
action.auto_create_index: ".security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
network.host: 0.0.0.0
transport.tcp.compress: true
http.port : 9400
transport.tcp.port: 9500
discovery.zen.minimum_master_nodes: 1
network.publish_host: 0.0.0.0
network.bind_host: 0.0.0.0

Data Node2

cluster.name: aabb
node.name: cloud1
node.master: false
node.data: true
path.data: /var/fpwork/workspace_smtrs/elasticsearch-6.4.0/data
path.logs: /var/fpwork/workspace_smtrs/elasticsearch-6.4.0/logs
#discovery.zen.ping.unicast.hosts: ["10.182.197.132","10.182.197.102","10.182.197.160"]
discovery.zen.ping.unicast.hosts: ["10.182.197.102:9500","10.182.197.132:9500","10.182.197.160:9500","10.159.252.99:9500"]
#discovery.zen.ping.unicast.hosts: ["10.182.197.102","10.182.197.132","10.182.197.160"]
bootstrap.system_call_filter: false
action.auto_create_index: ".security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
network.host: 0.0.0.0
transport.tcp.compress: true
http.port : 9400
transport.tcp.port: 9500
discovery.zen.minimum_master_nodes: 1
network.publish_host: 0.0.0.0
network.bind_host: 0.0.0.0

Data Node3

cluster.name: aabb
node.name: cloud2
node.master: false
node.data: true
path.data: /var/fpwork/workspace_smtrs/elasticsearch-6.4.0/data
path.logs: /var/fpwork/workspace_smtrs/elasticsearch-6.4.0/logs
discovery.zen.ping.unicast.hosts: ["10.182.197.102:9500","10.182.197.132:9500","10.182.197.160:9500","10.159.252.99:9500"]
bootstrap.system_call_filter: false
action.auto_create_index: ".security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
network.host: 0.0.0.0
transport.tcp.compress: true
http.port : 9400
transport.tcp.port: 9500
discovery.zen.minimum_master_nodes: 1
network.publish_host: 0.0.0.0
network.bind_host: 0.0.0.0

Console Snapshot

[2018-09-10T15:38:57,773][INFO ][o.e.n.Node ] [cloud2] starting ...
[2018-09-10T15:38:57,924][INFO ][o.e.t.TransportService ] [cloud2] publish_address {192.168.0.37:9500}, bound_addresses {[::]:9500}
[2018-09-10T15:38:57,937][INFO ][o.e.b.BootstrapChecks ] [cloud2] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-09-10T15:39:27,958][WARN ][o.e.n.Node ] [cloud2] timed out while waiting for initial discovery state - timeout: 30s
[2018-09-10T15:39:27,981][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [cloud2] publish_address {192.168.0.37:9400}, bound_addresses {[::]:9400}
[2018-09-10T15:39:27,982][INFO ][o.e.n.Node ] [cloud2] started
[2018-09-10T15:39:31,413][INFO ][o.e.d.z.ZenDiscovery ] [cloud2] failed to send join request to master [{linsee1}{OFlb-GtlR5e-HP84Fh0oeg}{oTbJsRaKRnO5u2HYzFaFbw}{10.159.252.99}{10.159.252.99:9500}{ml.machine_memory=202717806592, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[linsee1][10.159.252.99:9500][internal:discovery/zen/join]]; nested: ConnectTransportException[[cloud2][192.168.0.37:9500] connect_timeout[30s]]; ]
[2018-09-10T15:40:04,538][INFO ][o.e.d.z.ZenDiscovery ] [cloud2] failed to send join request to master [{linsee1}{OFlb-GtlR5e-HP84Fh0oeg}{oTbJsRaKRnO5u2HYzFaFbw}{10.159.252.99}{10.159.252.99:9500}{ml.machine_memory=202717806592, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[linsee1][10.159.252.99:9500][internal:discovery/zen/join]]; nested: ConnectTransportException[[cloud2][192.168.0.37:9500] connect_timeout[30s]]; ]
[2018-09-10T15:40:37,667][INFO ][o.e.d.z.ZenDiscovery ] [cloud2] failed to send join request to master [{linsee1}{OFlb-GtlR5e-HP84Fh0oeg}{oTbJsRaKRnO5u2HYzFaFbw}{10.159.252.99}{10.159.252.99:9500}{ml.machine_memory=202717806592, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[linsee1][10.159.252.99:9500][internal:discovery/zen/join]]; nested: ConnectTransportException[[cloud2][192.168.0.37:9500] connect_timeout[30s]

@Christian_Dahlqvist @dadoonet Can you please help me as I am stuck with the issue since long.

Do you have network connectivity between the nodes so they can all connect on port 9500?

When I am trying to do telnet on port 9400 from data nodes to Master node I am getting connection refused.

telnet 10.159.252.99 9500
Trying 10.159.252.99...
telnet: connect to address 10.159.252.99: Connection refused

NOTE - when ES is stopped that time we get error .

When running it is successful.

[smtrs@dhananjay-test-1 elasticsearch-6.4.0]$ telnet 10.159.252.99 9500
Trying 10.159.252.99...
Connected to 10.159.252.99.
Escape character is '^]'.

^CConnection closed by foreign host.
[smtrs@dhananjay-test-1 elasticsearch-6.4.0]$ telnet 10.159.252.99 9400
Trying 10.159.252.99...
Connected to 10.159.252.99.
Escape character is '^]'.

I am not sure I follow your description. The data nodes need to be able to connect to the master node, but they also need to connect to each other on port 9500. As you only have a single master-eligible node (which is not recommended as I mentioned in the related post) no node will be able to connect to the master when it is down and you will see errors and the cluster will be in a red state.

As you can see nodes are able to connect to port 9500. What else I should check for apart from port to solve this issue. Should I downgrade to 6.3 !! Please suggest.

Based on the logs it doesn't look like the nodes are able to connect. I do not see how downgrading to 6.3 would solve anything.

Start up all nodes and then systematically log into each node and verify you can telnet to port 9500 on all the other nodes from that host.

I am getting few warning too on master node. Please see if it can be helpful to identify the issue

Console log

[2018-09-10T17:22:39,431][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [linsee1] exception caught on transport layer [NettyTcpChannel{localAddress=/10.159.252.99:9500, remoteAddress=/10.182.197.132:48794}], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (ff,f4,ff,fd)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:392) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:359) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245) [netty-transport-4.1.16.Final.jar:4.1.16.Final]

Hi @Christian_Dahlqvist I have logged into each node and master and tried telnet and it is working fine. Please let me know if you want me to try anything else.

Tried with 3 nodes. Please see the node details below:-
Master - 10.182.197.102
Data Node 1 - 10.159.252.101
Data Node 2 - 10.159.252.99

From master
[smtrs@dhananjay-test-1 ~]$ telnet 10.159.252.99 9300
Trying 10.159.252.99...
Connected to 10.159.252.99.
Escape character is '^]'.
[smtrs@dhananjay-test-1 ~]$ telnet 10.159.252.99 9200
Trying 10.159.252.99...
Connected to 10.159.252.99.
Escape character is '^]'.
[smtrs@dhananjay-test-1 ~]$ telnet 10.159.252.101 9300
Trying 10.159.252.101...
Connected to 10.159.252.101.
Escape character is '^]'.
[smtrs@dhananjay-test-1 ~]$ telnet 10.159.252.101 9200
Trying 10.159.252.101...
Connected to 10.159.252.101.
Escape character is '^]'.

From Node 1
[smtrs@bhlinb42 ~]$ telnet 10.159.252.101 9300
Trying 10.159.252.101...
Connected to 10.159.252.101.
Escape character is '^]'.
[smtrs@bhlinb42 ~]$ telnet 10.159.252.101 9200
Trying 10.159.252.101...
Connected to 10.159.252.101.
Escape character is '^]'.

[smtrs@bhlinb42 ~]$ telnet 10.182.197.102 9200
Trying 10.182.197.102...
Connected to 10.182.197.102.
Escape character is '^]'.
[smtrs@bhlinb42 ~]$ telnet 10.182.197.102 9300
Trying 10.182.197.102...
Connected to 10.182.197.102.
Escape character is '^]'.

From Node2
[smtrs@bhlinb44 ~]$ telnet 10.159.252.99 9200
Trying 10.159.252.99...
Connected to 10.159.252.99.
Escape character is '^]'.
[smtrs@bhlinb44 ~]$ telnet 10.159.252.99 9300
Trying 10.159.252.99...
Connected to 10.159.252.99.
Escape character is '^]'.

[smtrs@bhlinb44 ~]$ telnet 10.182.197.102 9200
Trying 10.182.197.102...
Connected to 10.182.197.102.
Escape character is '^]'.
[smtrs@bhlinb44 ~]$ telnet 10.182.197.102 9300
Trying 10.182.197.102...
Connected to 10.182.197.102.
Escape character is '^]'.

Error Snapshot -
[2018-09-10T19:20:08,348][WARN ][o.e.d.z.ZenDiscovery ] [node_linsee1] failed to connect to master [{node_master1}{TvG_gvLEQ1mQd57h5qHpsA}{NbXYBOnxS-WQTuCLDc3E1Q}{192.168.0.48}{192.168.0.48:9300}{ml.machine_memory=25284042752, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], retrying...
org.elasticsearch.transport.ConnectTransportException: [node_master1][192.168.0.48:9300] connect_timeout[30s]
at org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163) ~[elasticsearch-6.3.2.jar:6.3.2]

Are these configured differently as post 9500 no longer seem to be used?

Yes. I have changed the machines and installed Elasticsearch 6.3.2 to check if it works with older version :frowning: Please see the configuration below

Master node-
cluster.name: new_cluster
node.name: node_master1
node.master: true
node.data: false
path.data: /var/fpwork/workspace_smtrs/elasticsearch-6.3.2/data
path.logs: /var/fpwork/workspace_smtrs/elasticsearch-6.3.2/logs
discovery.zen.ping.unicast.hosts: ["10.182.197.102","10.159.252.99","10.159.252.101"]
bootstrap.system_call_filter: false
action.auto_create_index: ".security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
network.host: 0.0.0.0
transport.tcp.compress: true
http.port : 9200
transport.tcp.port: 9300
discovery.zen.minimum_master_nodes: 1
network.publish_host: 0.0.0.0
network.bind_host: 0.0.0.0

Node1 -
cluster.name: new_cluster
node.name: node_linsee1
node.master: false
node.data: true
path.data: /var/fpwork/workspace_smtrs/elasticsearch-6.3.2/data
path.logs: /var/fpwork/workspace_smtrs/elasticsearch-6.3.2/logs
discovery.zen.ping.unicast.hosts: ["10.182.197.102","10.159.252.99","10.159.252.101"]
bootstrap.system_call_filter: false
action.auto_create_index: ".security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
network.host: 0.0.0.0
transport.tcp.compress: true
http.port : 9200
transport.tcp.port: 9300
discovery.zen.minimum_master_nodes: 1
network.publish_host: 0.0.0.0
network.bind_host: 0.0.0.0

Node2 -
cluster.name: new_cluster
node.name: node_linsee2
node.master: false
node.data: true
path.data: /var/fpwork/workspace_smtrs/elasticsearch-6.3.2/data
path.logs: /var/fpwork/workspace_smtrs/elasticsearch-6.3.2/logs
discovery.zen.ping.unicast.hosts: ["10.182.197.102","10.159.252.99","10.159.252.101"]
bootstrap.system_call_filter: false
action.auto_create_index: ".security*,.monitoring*,.watches,.triggered_watches,.watcher-history*,.ml*"
network.host: 0.0.0.0
transport.tcp.compress: true
http.port : 9200
transport.tcp.port: 9300
discovery.zen.minimum_master_nodes: 1
network.publish_host: 0.0.0.0
network.bind_host: 0.0.0.0

Please @Christian_Dahlqvist help me. I am stuck with this since long.

Getting below error in logs

[2018-09-10T19:21:54,508][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [node_linsee2] exception caught on transport layer [NettyTcpChannel{localAddress=/10.159.252.101:9300, remoteAddress=/10.159.252.99:54842}], closing connection
io.netty.handler.codec.DecoderException: java.io.StreamCorruptedException: invalid internal transport message format, got (d,a,ff,f4)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:459) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) ~[netty-codec-4.1.16.Final.jar:4.1.16.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]

I can not spot anything apparently wrong at the moment. What operating system and exact JVM version are you using?

I am using Linux with Java 1.8

openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-b10)
OpenJDK 64-Bit Server VM (build 25.171-b10, mixed mode)

[smtrs@dhananjay-test-1 ~]$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 6.8 (Santiago)

Do you have any plugins installed, e.g. related to security, that could affect connectivity? Anything else that is non-standard about the set-up?

I haven't installed anything other than Elastic as of now.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.