Generally we find that restarting only the single disconnected node is insufficient unless the master is also restarted. It's as if neither can let go of the broken pipe.
I've also been thinking that it's something in the tls verification, though it is interesting that it works for a time. For what it's worth, we're using elastic's certutil to generate the CA and are using p12 certs generated with the same tool. I have verified using keytool that the cert authority is on each machine, baked into the p12 cert for the node.
I will take a look at the jvm dns cache, though we're using IPs for the host discovery using unicast and the static node ips. We're also using "certificate" verification rather than "full", though I regenerated the certs a few days ago to add the ips, name and dns flags for good measure. I can also look at trying to get a packet capture for the initial failures.
In the meantime, here are a few more recent blocks of logs from the current cluster. It was last restarted about 20 hours ago and has been failing for approximately 19 hours.
[2020-10-15T04:07:45,389][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update node information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:07:45,389][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip-10-160-220-194] failed to execute on node [5iwQePJlRGGhfspzehIfjw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][cluster:monitor/nodes/stats[n]] request_id [76726] timed out after [30011ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T04:08:45,390][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip-10-160-220-194] failed to execute on node [5iwQePJlRGGhfspzehIfjw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][cluster:monitor/nodes/stats[n]] request_id [76822] timed out after [30010ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T04:09:45,392][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:10:45,394][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:11:45,396][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T01:20:46,725][WARN ][o.e.t.OutboundHandler ] [ip-10-160-220-194] send message failed [channel: Netty4TcpChannel{localAddress=/10.160.220.194:46932, remoteAddress=10.160.220.128/10.160.220.128:8443}]
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011) [netty-handler-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T01:23:08,773][WARN ][o.e.t.OutboundHandler ] [ip-10-160-220-194] send message failed [channel: Netty4TcpChannel{localAddress=/10.160.220.194:46996, remoteAddress=10.160.220.128/10.160.220.128:8443}]
javax.net.ssl.SSLException: handshake timed out
at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011) [netty-handler-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at java.lang.Thread.run(Thread.java:830) [?:?]
And on the disconnected node:
[2020-10-15T17:03:22,389][INFO ][o.e.c.c.JoinHelper ] [ip-10-160-220-128] failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join] request_id [116592] timed out after [59818ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T17:03:26,549][INFO ][o.e.c.c.JoinHelper ] [ip-10-160-220-128] last failed join attempt was 4.1s ago, failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join] request_id [116592] timed out after [59818ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) ~[elasticsearch-7.6.1.jar:7.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]
And
[2020-10-15T11:18:32,549][INFO ][o.e.c.c.JoinHelper ] [ip-10-160-220-128] failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join]
Caused by: org.elasticsearch.transport.ConnectTransportException: [ip-10-160-220-128][10.160.220.128:8443] connect_timeout[30s]
at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:995) ~[elasticsearch-7.6.1.jar:7.6.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
at java.lang.Thread.run(Thread.java:830) [?:?]