Nodes eventually disconnect with ssl enabled on transport layer

Hi Everyone,

I've got a 3-node, 7.6.1 cluster running on ec2 t2.medium instances in a private vpc we're using for a machine-learning project. We got everything set up fine and had no issues with a number of different test queries, even running some significant performance and sustained-use tests over several days. No excess garbage collection since the heap never gets over about 5% utilization with the small data sets we're testing on. No excessive cpu or load averages either, and the queries return more or less instantly.

Once we enable ssl transport with xpack, however, the cluster starts to fall apart after about an hour of use. Still no excess gc, no more heap usage than before, queries return instantly right up until they start to fail, which is when we start to see various node disconnection exceptions in the logs. For example:

[2020-10-14T17:17:14,257][INFO ][o.e.c.s.ClusterApplierService] [ip-10-160-220-194] added {{ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{JbxL1kCLQseIhxR2nwYcZw}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16789581824, ml.max_open_jobs=20, xpack.installed=true}}, term: 3567, version: 41397, reason: ApplyCommitRequest{term=3567, version=41397, sourceNode={ip-10-160-220-58}{5iwQePJlRGGhfspzehIfjw}{1xdjCFuGRVW1lV3LX2D41w}{10.160.220.58}{10.160.220.58:8443}{dilm}{ml.machine_memory=16789577728, ml.max_open_jobs=20, xpack.installed=true}}
[2020-10-14T18:20:01,684][INFO ][o.e.c.c.Coordinator      ] [ip-10-160-220-194] master node [{ip-10-160-220-58}{5iwQePJlRGGhfspzehIfjw}{1xdjCFuGRVW1lV3LX2D41w}{10.160.220.58}{10.160.220.58:8443}{dilm}{ml.machine_memory=16789577728, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{ip-10-160-220-58}{5iwQePJlRGGhfspzehIfjw}{1xdjCFuGRVW1lV3LX2D41w}{10.160.220.58}{10.160.220.58:8443}{dilm}{ml.machine_memory=16789577728, ml.max_open_jobs=20, xpack.installed=true}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:277) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1019) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][internal:coordination/fault_detection/leader_check] request_id [11943] timed out after [30032ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) ~[elasticsearch-7.6.1.jar:7.6.1]
        ... 4 more

Restarting all of the nodes gets everything back, but only until the seemingly-permanent disconnect crops up again. I've checked the tcp keepalive timeouts in addition to checking everything I can think of that would be externally killing these connections and found nothing. Disabling ssl transport resolves the issue. Any thoughts you might be able to provide would be much appreciated, and I'm happy to answer any clarifying questions.

Thank you

t2 range instances are in my experience not a great choice for Elasticsearch (possibly with the exception of dedicated master nodes under reasonably light load) as they can run out of CPU credits and get throttled at the wrong time, which can cause instability. I would recommend trying another instance type, e.g m5, and see if that resolves the issue.

Interesting, I suppose it's possible that we're pushing the limits with the overhead required for SSL between the nodes. I had considered the networking limitations a potential issue but mostly ruled that out with the successful tests without encryption. I'll try migrating the snapshots from my t2.xlarges (wrote the wrong type down in my first post) to m5.xlarges and report back.

Unfortunately, that does not seem to have changed the situation. Right about the same time as always, we started seeing the timeouts and query failures again.

Does the cluster completely fall to pieces at this point, or do the nodes drop out and then rejoin? If they don't rejoin, do they continue to report ReceiveTimeoutTransportExceptions? How long have you left it in that state before restarting things to recover your cluster?

Have you set tcp.retries2 to a more sensible value than the default of 15? What are your TCP keepalive settings exactly?

Are the clocks on the nodes all in sync?

The symptoms you describe are pretty weird, not like anything I've encountered before, but definitely sound like something wonky with the network to me. Not sure why it only happens with TLS enabled tho.

I agree, I've been working with elasticsearch in various capacities over the last 7 years and have never seen anything like this. To answer your questions, yes it's a permanent failure and is never able to re-establish the connection until restart. We have let it go for several days before just to see if it recovers, and instead, other nodes disconnect until each node is disconnected from the other two and is locked in a loop of trying to find enough masters to get the cluster back. Over time, the number of exceptions increases, and we see a growing backlog of tasks running on each node trying to sync shard state which eventually backlogs all tasks entirely.

I set the tcp keepalive settings at the os level (forgot to mention this is on ubuntu 18.04) to the following in sysctl.conf:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 6

I have not tried changing tcp.retries as yet.

Interestingly, in digging through syslog previously, we have seen some occasional failures for ntp to reach the servers, so there's likely some minor drift there, though we've seen that the clocks on the servers do appear to be within a second or less of UTC time when we check. That also has led us to investigate network issues, though I haven't gone as far as opening a support ticket with AWS just yet. The TLS-only piece has given me pause each time I think to.

Thanks, ok, that rules out a fair few things.

If you restart just one of the three nodes does it manage to re-form the cluster, or do you have to reboot two or more of them?

Can you share some sample logs from when the cluster has been failing for a while? The block quoted above looks like the start of the problems, but I don't think it'll carry on quite like that.

I'm wondering the trouble is in TLS's certificate verification which might be relying on another external system (e.g. DNS) that's not behaving properly. Note particularly that the JVM DNS cache may be in play, make sure you have this set up right. That was my hunch with clock sync too, but ±1sec should be good enough for TLS.

Might be worth analysing a packet capture -- that would tell us whether it's failing to establish a connection at all, or hanging during the TLS handshake, or establishing the TLS session and then failing at some other step.

Generally we find that restarting only the single disconnected node is insufficient unless the master is also restarted. It's as if neither can let go of the broken pipe.

I've also been thinking that it's something in the tls verification, though it is interesting that it works for a time. For what it's worth, we're using elastic's certutil to generate the CA and are using p12 certs generated with the same tool. I have verified using keytool that the cert authority is on each machine, baked into the p12 cert for the node.

I will take a look at the jvm dns cache, though we're using IPs for the host discovery using unicast and the static node ips. We're also using "certificate" verification rather than "full", though I regenerated the certs a few days ago to add the ips, name and dns flags for good measure. I can also look at trying to get a packet capture for the initial failures.

In the meantime, here are a few more recent blocks of logs from the current cluster. It was last restarted about 20 hours ago and has been failing for approximately 19 hours.

[2020-10-15T04:07:45,389][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update node information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:07:45,389][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip-10-160-220-194] failed to execute on node [5iwQePJlRGGhfspzehIfjw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][cluster:monitor/nodes/stats[n]] request_id [76726] timed out after [30011ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T04:08:45,390][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip-10-160-220-194] failed to execute on node [5iwQePJlRGGhfspzehIfjw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][cluster:monitor/nodes/stats[n]] request_id [76822] timed out after [30010ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T04:09:45,392][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:10:45,394][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:11:45,396][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T01:20:46,725][WARN ][o.e.t.OutboundHandler    ] [ip-10-160-220-194] send message failed [channel: Netty4TcpChannel{localAddress=/10.160.220.194:46932, remoteAddress=10.160.220.128/10.160.220.128:8443}]
javax.net.ssl.SSLException: handshake timed out
        at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011) [netty-handler-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T01:23:08,773][WARN ][o.e.t.OutboundHandler    ] [ip-10-160-220-194] send message failed [channel: Netty4TcpChannel{localAddress=/10.160.220.194:46996, remoteAddress=10.160.220.128/10.160.220.128:8443}]
javax.net.ssl.SSLException: handshake timed out
        at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011) [netty-handler-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at java.lang.Thread.run(Thread.java:830) [?:?]

And on the disconnected node:

[2020-10-15T17:03:22,389][INFO ][o.e.c.c.JoinHelper       ] [ip-10-160-220-128] failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join] request_id [116592] timed out after [59818ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T17:03:26,549][INFO ][o.e.c.c.JoinHelper       ] [ip-10-160-220-128] last failed join attempt was 4.1s ago, failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join] request_id [116592] timed out after [59818ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

And

[2020-10-15T11:18:32,549][INFO ][o.e.c.c.JoinHelper       ] [ip-10-160-220-128] failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join]
Caused by: org.elasticsearch.transport.ConnectTransportException: [ip-10-160-220-128][10.160.220.128:8443] connect_timeout[30s]
        at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:995) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

This means that connectivity between the nodes is OK but suggests it is indeed getting stuck at the verification stage.

Ok, I wasn't certain if that could ever come up by the connection simply failing or being killed after the initial connect or if that's strictly on a failed verification. Googling around didn't provide a ton of insight on that particular exception in the past, so I'm never quite sure what I should focus on and what's a red herring. I'll take a look at the jvm settings you noted and see if I can think of anything else. One idea we've had is to try an external cert authority just to see what happens, but I'd appreciate any guidance you have on what to look at first.

I finally had a chance to circle back on this. Reading the link you sent wasn't entirely clear on what you consider best tuned for this particular purpose, but I can confirm the default values are in-place as shown here:

jps -lvm
12267 jdk.jcmd/sun.tools.jps.Jps -lvm -Dapplication.home=/usr/lib/jvm/java-11-openjdk-amd64 -Xms8m -Djdk.module.main=jdk.jcmd
12107 org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=COMPAT -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/tmp/elasticsearch-7736255287516992118 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=2147483648 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -Des.bundled_jdk=tr

:+1: The default DNS cache settings look to be in place.

That's not to say it's not DNS, and there are other possibilities too (e.g. CRL or OCSP checks). Maybe there's better ways to diagnose hanging TLS handshakes, but personally I'd look at the packet captures.

Alright, I'll give that a shot today and report back.

Well, I'm not entirely sure what to make of this. I have found no alerts in the ssl traffic between nodes, however, while ES is reporting timeouts and issues, I have seen a clear pattern of reconnecting and renegotiating SSL between two of the nodes (we'll call them 1 and 2) every so often. Sometimes it finishes the handshake and encrypted application data starts flowing, sometimes only a Client Hello shows up, sometimes it gets to Server Hello in response and sometimes they make it all the way to the ciphr change and encrypted traffic. Meanwhile, the connection between nodes 1 and 3 remains constant with application data flowing freely at all times.

The odd part is that there doesn't seem to be any alert or other indication of a problem anywhere in the output, it just shows frequent, new "Client Hello" packets.

It's possible I'm missing something, though, this is what I've got for tshark:

tshark -V -i eth0 -f "port 8443" -Y ssl

Right, so the question is whether those handshakes are triggering other network traffic that isn't getting a timely response, for instance DNS lookups or maybe HTTP(S). You'll need to look further afield than just a single TCP port.

Well, I managed to capture a few encrypted alerts to something in aws on the 52.94.XXX.XXX ip range, so I'm looking into potential issues with our network team. In the meantime, though, I found a few of these, recently that might shed some light on the issue:

2020-10-20T09:30:33,775][WARN ][o.e.t.TcpTransport       ] [ip-10-160-220-58] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.160.220.58:8443, remoteAddress=/10.160.220.128:34606}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

Sadly not, this is just something that's logged when a TLS connection is closed abruptly but doesn't give any information as to why. These messages were demoted to DEBUG level recently.