New node cannot join existing cluster: IOException[Connection Refused]

Hi All,

this has me puzzled for quite a while now.
I've been trying to add a 3rd node to our 2-node setup to avoid split brain situation but for the life of me I cannot get the 3rd node inside the cluster.
On a network level a telnet test to port 9300 works.
Everything running in VMware, running on same vlan, so only iptables to think about and that's all done.

Here's the error:

[2018-07-29T02:13:52,319][INFO ][o.e.n.Node ] [sl1-elk-es-01.hosted.eu.flextrade.com] initializing ...
[2018-07-29T02:13:52,376][INFO ][o.e.e.NodeEnvironment ] [sl1-elk-es-01.hosted.eu.flextrade.com] using [1] data paths, mounts [[/var/lib/elasticsearch/data (/dev/sdc)]], net usable_space [75.4gb], net total_space [78.7gb], types [ext4]
[2018-07-29T02:13:52,377][INFO ][o.e.e.NodeEnvironment ] [sl1-elk-es-01.hosted.eu.flextrade.com] heap size [7.9gb], compressed ordinary object pointers [true]
[2018-07-29T02:13:52,378][INFO ][o.e.n.Node ] [sl1-elk-es-01.hosted.eu.flextrade.com] node name [sl1-elk-es-01.hosted.eu.flextrade.com], node ID [Q_vlX8nCTKinkcef3q3yBQ]
[2018-07-29T02:13:52,378][INFO ][o.e.n.Node ] [sl1-elk-es-01.hosted.eu.flextrade.com] version[6.2.3], pid[16158], build[c59ff00/2018-03-13T10:06:29.741383Z], OS[Linux/3.10.0-862.9.1.el7.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_40/25.40-b25]
[2018-07-29T02:13:52,378][INFO ][o.e.n.Node ] [sl1-elk-es-01.hosted.eu.flextrade.com] JVM arguments [-Xms8g, -Xmx8g, -XX:+UseConcMarkSweepGC, -XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.io.tmpdir=/tmp/elasticsearch.0tM3MBCN, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:+PrintGCDetails, -XX:+PrintGCDateStamps, -XX:+PrintTenuringDistribution, -XX:+PrintGCApplicationStoppedTime, -Xloggc:/var/log/elasticsearch/gc.log, -XX:+UseGCLogFileRotation, -XX:NumberOfGCLogFiles=32, -XX:GCLogFileSize=64m, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch]
[2018-07-29T02:13:52,969][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [aggs-matrix-stats]
[2018-07-29T02:13:52,969][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [analysis-common]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [ingest-common]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [lang-expression]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [lang-mustache]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [lang-painless]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [mapper-extras]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [parent-join]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [percolator]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [rank-eval]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [reindex]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [repository-url]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [transport-netty4]
[2018-07-29T02:13:52,970][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] loaded module [tribe]
[2018-07-29T02:13:52,971][INFO ][o.e.p.PluginsService ] [sl1-elk-es-01.hosted.eu.flextrade.com] no plugins loaded
[2018-07-29T02:13:55,159][INFO ][o.e.d.DiscoveryModule ] [sl1-elk-es-01.hosted.eu.flextrade.com] using discovery type [zen]
[2018-07-29T02:13:55,579][INFO ][o.e.n.Node ] [sl1-elk-es-01.hosted.eu.flextrade.com] initialized
[2018-07-29T02:13:55,579][INFO ][o.e.n.Node ] [sl1-elk-es-01.hosted.eu.flextrade.com] starting ...
[2018-07-29T02:13:55,695][INFO ][o.e.t.TransportService ] [sl1-elk-es-01.hosted.eu.flextrade.com] publish_address {10.3.25.201:9300}, bound_addresses {10.3.25.201:9300}
[2018-07-29T02:13:55,704][INFO ][o.e.b.BootstrapChecks ] [sl1-elk-es-01.hosted.eu.flextrade.com] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-07-29T02:13:58,833][INFO ][o.e.d.z.ZenDiscovery ] [sl1-elk-es-01.hosted.eu.flextrade.com] failed to send join request to master [{sl1-elk-es-02.hosted.eu.flextrade.com}{NDlRBQjmQzW9Me-g_gVPeg}{iFUZm66_RDSOT8Yi9CJopQ}{10.3.25.202}{10.3.25.202:9300}], reason [RemoteTransportException[[sl1-elk-es-02.hosted.eu.flextrade.com][10.3.25.202:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[sl1-elk-es-01.hosted.eu.flextrade.com][10.3.25.201:9300] connect_exception]; nested: IOException[Connection refused: 10.3.25.201/10.3.25.201:9300]; nested: IOException[Connection refused]; ]
[2018-07-29T02:14:01,857][INFO ][o.e.d.z.ZenDiscovery ] [sl1-elk-es-01.hosted.eu.flextrade.com] failed to send join request to master [{sl1-elk-es-02.hosted.eu.flextrade.com}{NDlRBQjmQzW9Me-g_gVPeg}{iFUZm66_RDSOT8Yi9CJopQ}{10.3.25.202}{10.3.25.202:9300}], reason [RemoteTransportException[[sl1-elk-es-02.hosted.eu.flextrade.com][10.3.25.202:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[sl1-elk-es-01.hosted.eu.flextrade.com][10.3.25.201:9300] connect_exception]; nested: IOException[Connection refused: 10.3.25.201/10.3.25.201:9300]; nested: IOException[Connection refused]; ]
[2018-07-29T02:14:04,887][INFO ][o.e.d.z.ZenDiscovery ] [sl1-elk-es-01.hosted.eu.flextrade.com] failed to send join request to master [{sl1-elk-es-02.hosted.eu.flextrade.com}{NDlRBQjmQzW9Me-g_gVPeg}{iFUZm66_RDSOT8Yi9CJopQ}{10.3.25.202}{10.3.25.202:9300}], reason [RemoteTransportException[[sl1-elk-es-02.hosted.eu.flextrade.com][10.3.25.202:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[sl1-elk-es-01.hosted.eu.flextrade.com][10.3.25.201:9300] connect_exception]; nested: IOException[Connection refused: 10.3.25.201/10.3.25.201:9300]; nested: IOException[Connection refused]; ]

Any help would be appreciated. I've seen similar topics on the interwebs but none provide more insight into the problem...

Kind Regards

Jan

here are the telnet tests:

[root@sl1-elk-es-01 ~]# telnet sl1-elk-es-02 9300
Trying 10.3.25.202...
Connected to sl1-elk-es-02.
Escape character is '^]'.
^]quit
telnet quit
Connection closed.

[root@sl1-elk-es-01 ~]# telnet sl1-elk-es-03 9300
Trying 10.3.25.203...
Connected to sl1-elk-es-03.
Escape character is '^]'.
^]quit
telnet quit

Here's the config of the node that does not want to join the fold:

grep "^[^#]" /etc/elasticsearch/elasticsearch.yml
cluster.name: ld4
node.name: ${HOSTNAME}
path.data: /var/lib/elasticsearch/data
path.logs: /var/log/elasticsearch
network.host: 10.3.25.201
discovery.zen.ping.unicast.hosts: ["10.3.25.201", "10.3.25.202", "10.3.25.203"]
discovery.zen.minimum_master_nodes: 1
action.destructive_requires_name: true
thread_pool.search.queue_size : 4000

Here's the config of the master node it's trying to connect:

grep "^[^#]" /etc/elasticsearch/elasticsearch.yml
cluster.name: ld4
node.name: ${HOSTNAME}
node.master: true
node.data: true
path.data: /var/lib/elasticsearch/data
path.logs: /var/log/elasticsearch
network.host: 10.3.25.202
discovery.zen.ping.unicast.hosts: ["10.3.25.201","10.3.25.202","10.3.25.203"]
discovery.zen.minimum_master_nodes: 1
action.destructive_requires_name: true
thread_pool.search.queue_size : 4000

Cheers

Jan

did some more digging:

-rebuilt the vm for the node that can't join the cluster, same hostname, same ip, same settings (using config management)
-set up tcpdump on master node sl1-elk-es-02 to check if traffic actually reaches the master node. It does.
So does this mean the master node does not let a new node join the cluster???
-set discovery.zen.minimum_master_nodes: 2 from 1 to see if this prevented the new node from joining the cluster. Still no joy....

Next steps:
I'll make a new vm with different hostname and ip to see if the same behaviour persists.

some more stats on the vm's

*All are running CentOS Linux release 7.5.1804 (Core)
*All run Java 1.8.0_40
*Version of Elasticsearch: 6.2.3
*4cores, 16 GB or RAM, 8GB assigned to elasticsearch via jvm.options

The annoying thing is I'm tailing all relevant logs when trying to join, but nothing appears of this event in the logs of the master node, so we're kind of flying blind there...

Cheers

Jan

there must have been some stale entries from sl1-elk-es-01 on sl1-elk-es-02 that it didn't like, because sl1-elk-es-04 could join the cluster first try...
Exact same config, only different IP address and hostname...

Ah well, guess this can be closed then...

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.