Elasticsearch Cluster issues

Hello,

I am trying to build an elasticsearch cluster and keep running into an issue. I have the following:

5 physical nodes, centos 7 installed and trying to run dedicated master/data nodes.
ElasticSearch 6.6.2

I have gotten gotten 1 Master node and 1 communicating node to work correctly, but when I add a data node to the mix, the cluster no longer works.

Working Master config:
cluster.name: cluster
node.name: elk01
node.master: true
node.data: false
network.host: 10.0.0.200
discovery.zen.ping.unicast.hosts: ["10.0.0.200"]

Data node config:
cluster.name: cluster
node.name: elk02
node.master: false
node.data: true
network.host: 10.0.0.201
discovery.zen.ping.unicast.hosts: ["10.0.0.200"]

Right now I am only working with 1 master and 1 data node to see if I can get it to work and expand further. The logs I get from the master are:

[2019-04-24T15:20:27,810][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elk01] failed to execute on node [xA9qaH-JQZ6G5cdZRi_8vw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk02][10.0.0.201:9300][cluster:monitor/nodes/stats[n]] request_id [828] timed out after [15010ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1011) [elasticsearch-6.6.2.jar:6.6.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.2.jar:6.6.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

The IP address in the error message, 10.0.0.1, doesn't match the one in the config you quoted, 10.0.0.201, which is strange.

Also is this the only message you get or are there other kinds of message too? This message tells us that elk02 didn't respond to a stats request within 15 seconds, which isn't great, but doesn't really tell us why.

Sorry I switched IP's from what we are set as and didn't change them right in the log. I have edited the main post with same info..

As for other logs, not really. Here is more of the same log.

[2019-04-24T15:07:58,484][INFO ][o.e.c.s.MasterService ] [elk01] zen-disco-node-join[{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason: added {{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQc GBTtG5lwF9_lDZLw}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2019-04-24T15:07:59,282][INFO ][o.e.c.s.ClusterApplierService] [elk01] added {{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{172.16.4.201 :9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {elk01}{D06O sOstQjWunae6-UublQ}{uT-BNQ30SeuZN-X3AHPzkA}{10.0.0.200}{10.0.0.200:9300}{ml.machine_memory=33567576064, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=tru e} committed version [15] source [zen-disco-node-join[{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=40 5511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]]])
[2019-04-24T15:10:42,791][WARN ][o.e.c.InternalClusterInfoService] [elk01] Failed to update node information for ClusterInfoUpdateJob within 15s timeout
[2019-04-24T15:10:42,792][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [elk01] failed to execute on node [xA9qaH-JQZ6G5cdZRi_8vw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [elk02][10.0.0.201:9300][cluster:monitor/nodes/stats[n]] request_id [195] timed out after [15012ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1011) [elasticsearch-6.6.2.jar:6.6.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.2.jar:6.6.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

Then eventually the following:

[2019-04-24T15:53:56,494][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [elk01] failed to execute on node [xA9qaH-JQZ6G5cdZRi_8vw]
org.elasticsearch.transport.NodeDisconnectedException: [elk02][10.0.0.201:9300][cluster:monitor/nodes/info[n]] disconnected
[2019-04-24T15:53:56,495][INFO ][o.e.c.s.MasterService ] [elk01] zen-disco-node-failed({elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{1 0.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(transport disconnected)[{elk02}{xA9qaH-JQZ6G 5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} tran sport disconnected], reason: removed {{cvelk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{172.16.4.201:9300}{ml.machine_memory=405511733248, ml.m ax_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2019-04-24T15:53:56,496][INFO ][o.e.c.s.ClusterApplierService] [elk02] removed {{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{10.0.0.2 01:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {cvelk01}{D0 6OsOstQjWunae6-UublQ}{uT-BNQ30SeuZN-X3AHPzkA}{172.16.4.200}{172.16.4.200:9300}{ml.machine_memory=33567576064, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=t rue} committed version [16] source [zen-disco-node-failed({cvelk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{172.16.4.201}{172.16.4.201:9300}{ml.machine_memor y=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}), reason(transport disconnected)[{cvelk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw }{172.16.4.201}{172.16.4.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} transport disconnected]]])

I suspect network issues. Is there a firewall or intrusion-detection system in between these nodes? It looks like connections are being established but then traffic starts getting dropped. I've seen this behaviour before when an over-enthusiastic firewall decides that some of the traffic between nodes looks suspicious and blocks it.

I have turned off firewalld on all nodes to test that. With same result. They all sit on the same switch and do not pass a firewall to get to each other.

I have used telnet to connect to port 9200 and 9300 from each node to each node, all successful attempts.

Yes, it looks like it's possible to form a connection, but then the traffic is blocked after the connection is established.

Can you start your nodes with logger.org.elasticsearch.transport: TRACE to trace the actual messages going back and forth? You'll get quite a lot of log output, but what we're looking for is evidence that at some point requests from one node stop arriving at the other node.

I will make that adjustment now and let you know.

In trying to figure out any other network related issues, I realized the initial config is actually different switches. elk01 is a VM and elk02 is a physical box.. Switched out elk01 for elk05 and this is what I get.

ELk01:

[2019-04-25T09:13:57,632][INFO ][o.e.d.z.ZenDiscovery ] [elk01] failed to send join request to master [{elk05}{86cSjEHqSP-xznIsgzf15w}{RRbTggVMTbOWx_DwzR03wQ}{10.0.0.205}{10.0.0.205:9300}{ml.machine_memory=101126189056, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [ElasticsearchTimeoutException[java .util.concurrent.TimeoutException: Timeout waiting for task.]; nested: TimeoutException[Timeout waiting for task.]; ]

ELK05 (New Master):

[2019-04-25T09:01:21,764][WARN ][o.e.d.z.ZenDiscovery     ] [elk05] failed to validate incoming join request from node [{elk01}{jz7Kr0TlRyaR9rNNKErj9Q}{sr-WQlObSVqh5TvRhOyxLg}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]
org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:62) ~[elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:32) ~[elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.discovery.zen.MembershipAction.sendValidateJoinRequestBlocking(MembershipAction.java:106) ~[elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(ZenDiscovery.java:888) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(ZenDiscovery.java:1135) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:142) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:138) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:30) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:250) [x-pack-security-6.6.2.jar:6.6.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:308) [x-pack-security-6.6.2.jar:6.6.2]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1289) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.6.2.jar:6.6.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:235) ~[elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:69) ~[elasticsearch-6.6.2.jar:6.6.2]
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:60) ~[elasticsearch-6.6.2.jar:6.6.2]

By the way could you format your log output using the </> button so it looks like this:

[2019-04-24T15:07:58,484][INFO ][o.e.c.s.MasterService ] [elk01] zen-disco-node-join[{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason: added {{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQc GBTtG5lwF9_lDZLw}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}
[2019-04-24T15:07:59,282][INFO ][o.e.c.s.ClusterApplierService] [elk01] added {{elk02}{xA9qaH-JQZ6G5cdZRi_8vw}{qbtkQcGBTtG5lwF9_lDZLw}{10.0.0.201}{172.16.4.201 :9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true},}, reason: apply cluster state (from master [master {elk01}{D06O sOstQjWunae6-UublQ}{uT-BNQ30SeuZN-X3AHPzkA}{10.0.0.200}{10.0.0.200:9300}
...

It's really hard to read it otherwise.

yeah sorry, new to all this.

No probs, much better now.

The new messages also look like some kind of connectivity issue and logger.org.elasticsearch.transport: TRACE would help us see whether traffic is getting through or not.

Man, that increases the log greatly.. Here is some from them:

Master ELK01:

[2019-04-25T09:22:39,154][TRACE][o.e.t.n.ESLoggingHandler ] [elk01] [id: 0x5349cd5f, L:/10.0.0.200:51918 - R:10.0.0.201/10.0.0.201:9300] ACTIVE
[2019-04-25T09:22:39,157][TRACE][o.e.t.TransportLogger    ] [elk01] an exception occurred formatting a WRITE trace message
java.io.EOFException: tried to read: 105 bytes but only 27 remaining
	at org.elasticsearch.common.bytes.BytesReferenceStreamInput.ensureCanReadBytes(BytesReferenceStreamInput.java:121) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper.ensureCanReadBytes(BytesReference.java:283) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.io.stream.StreamInput.readArraySize(StreamInput.java:1026) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:374) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.io.stream.StreamInput.readStringArray(StreamInput.java:469) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportLogger.format(TransportLogger.java:101) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportLogger.logOutboundMessage(TransportLogger.java:55) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:759) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.sendRequestToChannel(TcpTransport.java:745) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.lambda$new$0(TcpTransport.java:174) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportHandshaker.sendHandshake(TransportHandshaker.java:74) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.executeHandshake(TcpTransport.java:1313) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onResponse(TcpTransport.java:1534) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onResponse(TcpTransport.java:1511) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:96) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:39) ~[elasticsearch-core-6.6.2.jar:6.6.2]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_212]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_212]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212]
	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) ~[?:1.8.0_212]
	at org.elasticsearch.common.concurrent.CompletableContext.complete(CompletableContext.java:61) ~[elasticsearch-core-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$new$1(Netty4TcpChannel.java:65) ~[transport-netty4-client-6.6.2.jar:6.6.2]
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:103) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:306) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:341) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
[2019-04-25T09:22:39,159][TRACE][o.e.t.n.ESLoggingHandler ] [elk01] [id: 0xc91e928f, L:/10.0.0.200:51914 - R:10.0.0.201/10.0.0.201:9300] ACTIVE

Data Node ELK02

[2019-04-25T09:22:38,971][TRACE][o.e.t.n.ESLoggingHandler ] [elk02] [id: 0xf5fb0465, L:/10.0.0.201:59628 - R:10.0.0.200/10.0.0.200:9300] ACTIVE
[2019-04-25T09:22:38,973][TRACE][o.e.t.TransportLogger    ] [elk02] an exception occurred formatting a WRITE trace message
java.io.EOFException: tried to read: 105 bytes but only 27 remaining
	at org.elasticsearch.common.bytes.BytesReferenceStreamInput.ensureCanReadBytes(BytesReferenceStreamInput.java:121) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper.ensureCanReadBytes(BytesReference.java:283) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.io.stream.StreamInput.readArraySize(StreamInput.java:1026) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:374) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.io.stream.StreamInput.readStringArray(StreamInput.java:469) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportLogger.format(TransportLogger.java:101) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportLogger.logOutboundMessage(TransportLogger.java:55) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.internalSendMessage(TcpTransport.java:759) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.sendRequestToChannel(TcpTransport.java:745) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.lambda$new$0(TcpTransport.java:174) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TransportHandshaker.sendHandshake(TransportHandshaker.java:74) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport.executeHandshake(TcpTransport.java:1313) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onResponse(TcpTransport.java:1534) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onResponse(TcpTransport.java:1511) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:96) ~[elasticsearch-6.6.2.jar:6.6.2]
	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:39) ~[elasticsearch-core-6.6.2.jar:6.6.2]
	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_212]
	at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) ~[?:1.8.0_212]
	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_212]
	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962) ~[?:1.8.0_212]
	at org.elasticsearch.common.concurrent.CompletableContext.complete(CompletableContext.java:61) ~[elasticsearch-core-6.6.2.jar:6.6.2]
	at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$new$1(Netty4TcpChannel.java:65) ~[transport-netty4-client-6.6.2.jar:6.6.2]
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:103) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:306) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:341) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
[2019-04-25T09:22:38,977][TRACE][o.e.t.n.ESLoggingHandler ] [elk02] [id: 0x23c6ea44, L:/10.0.0.201:59630 - R:10.0.0.200/10.0.0.200:9300] ACTIVE

Yes, sorry, it is quite verbose. Actually the messages you quoted aren't anything bad, although we should look into where they're coming from. Perhaps logger.org.elasticsearch.transport: TRACE was too much, I think logger.org.elasticsearch.transport.TransportService.tracer: TRACE would be enough. A normal request/response looks like this:

[2019-04-25T14:58:45,824][TRACE][o.e.t.T.tracer           ] [node-1] [10][internal:discovery/zen/join/validate] sent to [{node-0}{q-Sbv3tdT3-wMRt7imJpOQ}{Lb9v-1d1SGSGS4mqpVZMAA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}] (timeout: [null])
[2019-04-25T14:58:45,830][TRACE][o.e.t.T.tracer           ] [node-0] [10][internal:discovery/zen/join/validate] received request
[2019-04-25T14:58:45,848][TRACE][o.e.t.T.tracer           ] [node-0] [10][internal:discovery/zen/join/validate] sent response
[2019-04-25T14:58:45,852][TRACE][o.e.t.T.tracer           ] [node-1] [10][internal:discovery/zen/join/validate] received response from [{node-0}{q-Sbv3tdT3-wMRt7imJpOQ}{Lb9v-1d1SGSGS4mqpVZMAA}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]

Note [10] is the request ID and [internal:discovery/zen/join/validate] is the name of the action. I think that the failed to validate incoming join request from node message you're seeing indicates that one or more of these steps isn't happening.

Here are the logs. Added to pastebin so you can see the complete log from startup till I got the error on the master.

master Node: https://pastebin.com/embed_js/4ZVr6sF8

Date Node: https://pastebin.com/embed_iframe/qzNMDnii

Thanks, that's useful. Many stats requests (cluster:monitor/nodes/stats[n]) are working ok:

[2019-04-25T10:08:46,449][TRACE][o.e.t.T.tracer           ] [elk010] [302][cluster:monitor/nodes/stats[n]] sent to [{elk02}{hxY_wL5KQpSTD3tI-t6duw}{UnjeuuC3TJO9flA1kUm7rA}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled
[2019-04-25T10:08:46,449][TRACE][o.e.t.T.tracer           ] [elk02]  [302][cluster:monitor/nodes/stats[n]] received request
[2019-04-25T10:08:46,452][TRACE][o.e.t.T.tracer           ] [elk02]  [302][cluster:monitor/nodes/stats[n]] sent response
[2019-04-25T10:08:46,452][TRACE][o.e.t.T.tracer           ] [elk010] [302][cluster:monitor/nodes/stats[n]] received response from [{elk02}{hxY_wL5KQpSTD3tI-t6duw}{UnjeuuC3TJO9flA1kUm7rA}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=t

However it seems the info requests (cluster:monitor/nodes/info[n]) are not:

[2019-04-25T10:09:14,938][TRACE][o.e.t.T.tracer           ] [elk010] [332][cluster:monitor/nodes/info[n]] sent to [{elk02}{hxY_wL5KQpSTD3tI-t6duw}{UnjeuuC3TJO9flA1kUm7rA}{10.0.0.201}{10.0.0.201:9300}{ml.machine_memory=405511733248, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=
[2019-04-25T10:09:14,939][TRACE][o.e.t.T.tracer           ] [elk02]  [332][cluster:monitor/nodes/info[n]] received request
[2019-04-25T10:09:14,941][TRACE][o.e.t.T.tracer           ] [elk02]  [332][cluster:monitor/nodes/info[n]] sent response

Note that node elk02 sent the response, but it seems that it never arrives. This pattern seems fairly consistent. Really this looks like something weird happening on your network.

yeah, I think we have figured something out on this..

All of my nodes are set at MTU 9000 in centos but the ports on the switch are at MTU 2500 for the physical nodes..

I am adjusting the data node elk02 to MTU 1500 right now to see if that helps. ELK01 is a VM that has full support for jumbo frames all the way through, so keeping at 9000 right now.

Will update as soon as I know more.

So looks like MTU fixed my issue... Changing the physical host allowed everything to communicate correctly.

Guess, my assumption of a 10g link being set to jumbo by default bit me this time...

Thanks for all the help @DavidTurner

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.