Nodes eventually disconnect with ssl enabled on transport layer

Sebastian_Seith · October 14, 2020, 8:02pm

Hi Everyone,

I've got a 3-node, 7.6.1 cluster running on ec2 t2.medium instances in a private vpc we're using for a machine-learning project. We got everything set up fine and had no issues with a number of different test queries, even running some significant performance and sustained-use tests over several days. No excess garbage collection since the heap never gets over about 5% utilization with the small data sets we're testing on. No excessive cpu or load averages either, and the queries return more or less instantly.

Once we enable ssl transport with xpack, however, the cluster starts to fall apart after about an hour of use. Still no excess gc, no more heap usage than before, queries return instantly right up until they start to fail, which is when we start to see various node disconnection exceptions in the logs. For example:

[2020-10-14T17:17:14,257][INFO ][o.e.c.s.ClusterApplierService] [ip-10-160-220-194] added {{ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{JbxL1kCLQseIhxR2nwYcZw}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16789581824, ml.max_open_jobs=20, xpack.installed=true}}, term: 3567, version: 41397, reason: ApplyCommitRequest{term=3567, version=41397, sourceNode={ip-10-160-220-58}{5iwQePJlRGGhfspzehIfjw}{1xdjCFuGRVW1lV3LX2D41w}{10.160.220.58}{10.160.220.58:8443}{dilm}{ml.machine_memory=16789577728, ml.max_open_jobs=20, xpack.installed=true}}
[2020-10-14T18:20:01,684][INFO ][o.e.c.c.Coordinator      ] [ip-10-160-220-194] master node [{ip-10-160-220-58}{5iwQePJlRGGhfspzehIfjw}{1xdjCFuGRVW1lV3LX2D41w}{10.160.220.58}{10.160.220.58:8443}{dilm}{ml.machine_memory=16789577728, ml.max_open_jobs=20, xpack.installed=true}] failed, restarting discovery
org.elasticsearch.ElasticsearchException: node [{ip-10-160-220-58}{5iwQePJlRGGhfspzehIfjw}{1xdjCFuGRVW1lV3LX2D41w}{10.160.220.58}{10.160.220.58:8443}{dilm}{ml.machine_memory=16789577728, ml.max_open_jobs=20, xpack.installed=true}] failed [3] consecutive checks
        at org.elasticsearch.cluster.coordination.LeaderChecker$CheckScheduler$1.handleException(LeaderChecker.java:277) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1118) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1019) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][internal:coordination/fault_detection/leader_check] request_id [11943] timed out after [30032ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) ~[elasticsearch-7.6.1.jar:7.6.1]
        ... 4 more

Restarting all of the nodes gets everything back, but only until the seemingly-permanent disconnect crops up again. I've checked the tcp keepalive timeouts in addition to checking everything I can think of that would be externally killing these connections and found nothing. Disabling ssl transport resolves the issue. Any thoughts you might be able to provide would be much appreciated, and I'm happy to answer any clarifying questions.

Thank you

Christian_Dahlqvist · October 14, 2020, 8:12pm

t2 range instances are in my experience not a great choice for Elasticsearch (possibly with the exception of dedicated master nodes under reasonably light load) as they can run out of CPU credits and get throttled at the wrong time, which can cause instability. I would recommend trying another instance type, e.g m5, and see if that resolves the issue.

Sebastian_Seith · October 14, 2020, 8:21pm

Interesting, I suppose it's possible that we're pushing the limits with the overhead required for SSL between the nodes. I had considered the networking limitations a potential issue but mostly ruled that out with the successful tests without encryption. I'll try migrating the snapshots from my t2.xlarges (wrote the wrong type down in my first post) to m5.xlarges and report back.

Sebastian_Seith · October 14, 2020, 9:40pm

Unfortunately, that does not seem to have changed the situation. Right about the same time as always, we started seeing the timeouts and query failures again.

DavidTurner · October 15, 2020, 8:40am

Does the cluster completely fall to pieces at this point, or do the nodes drop out and then rejoin? If they don't rejoin, do they continue to report ReceiveTimeoutTransportExceptions? How long have you left it in that state before restarting things to recover your cluster?

Have you set tcp.retries2 to a more sensible value than the default of 15? What are your TCP keepalive settings exactly?

Are the clocks on the nodes all in sync?

The symptoms you describe are pretty weird, not like anything I've encountered before, but definitely sound like something wonky with the network to me. Not sure why it only happens with TLS enabled tho.

Sebastian_Seith · October 15, 2020, 3:47pm

I agree, I've been working with elasticsearch in various capacities over the last 7 years and have never seen anything like this. To answer your questions, yes it's a permanent failure and is never able to re-establish the connection until restart. We have let it go for several days before just to see if it recovers, and instead, other nodes disconnect until each node is disconnected from the other two and is locked in a loop of trying to find enough masters to get the cluster back. Over time, the number of exceptions increases, and we see a growing backlog of tasks running on each node trying to sync shard state which eventually backlogs all tasks entirely.

I set the tcp keepalive settings at the os level (forgot to mention this is on ubuntu 18.04) to the following in sysctl.conf:

net.ipv4.tcp_keepalive_time = 600
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 6

I have not tried changing tcp.retries as yet.

Interestingly, in digging through syslog previously, we have seen some occasional failures for ntp to reach the servers, so there's likely some minor drift there, though we've seen that the clocks on the servers do appear to be within a second or less of UTC time when we check. That also has led us to investigate network issues, though I haven't gone as far as opening a support ticket with AWS just yet. The TLS-only piece has given me pause each time I think to.

DavidTurner · October 15, 2020, 4:10pm

Thanks, ok, that rules out a fair few things.

If you restart just one of the three nodes does it manage to re-form the cluster, or do you have to reboot two or more of them?

Can you share some sample logs from when the cluster has been failing for a while? The block quoted above looks like the start of the problems, but I don't think it'll carry on quite like that.

I'm wondering the trouble is in TLS's certificate verification which might be relying on another external system (e.g. DNS) that's not behaving properly. Note particularly that the JVM DNS cache may be in play, make sure you have this set up right. That was my hunch with clock sync too, but ±1sec should be good enough for TLS.

Might be worth analysing a packet capture -- that would tell us whether it's failing to establish a connection at all, or hanging during the TLS handshake, or establishing the TLS session and then failing at some other step.

Sebastian_Seith · October 15, 2020, 5:10pm

Generally we find that restarting only the single disconnected node is insufficient unless the master is also restarted. It's as if neither can let go of the broken pipe.

I've also been thinking that it's something in the tls verification, though it is interesting that it works for a time. For what it's worth, we're using elastic's certutil to generate the CA and are using p12 certs generated with the same tool. I have verified using keytool that the cert authority is on each machine, baked into the p12 cert for the node.

I will take a look at the jvm dns cache, though we're using IPs for the host discovery using unicast and the static node ips. We're also using "certificate" verification rather than "full", though I regenerated the certs a few days ago to add the ips, name and dns flags for good measure. I can also look at trying to get a packet capture for the initial failures.

In the meantime, here are a few more recent blocks of logs from the current cluster. It was last restarted about 20 hours ago and has been failing for approximately 19 hours.

[2020-10-15T04:07:45,389][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update node information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:07:45,389][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip-10-160-220-194] failed to execute on node [5iwQePJlRGGhfspzehIfjw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][cluster:monitor/nodes/stats[n]] request_id [76726] timed out after [30011ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T04:08:45,390][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip-10-160-220-194] failed to execute on node [5iwQePJlRGGhfspzehIfjw]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-58][10.160.220.58:8443][cluster:monitor/nodes/stats[n]] request_id [76822] timed out after [30010ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T04:09:45,392][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:10:45,394][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout
[2020-10-15T04:11:45,396][WARN ][o.e.c.InternalClusterInfoService] [ip-10-160-220-194] Failed to update shard information for ClusterInfoUpdateJob within 30s timeout

[2020-10-15T01:20:46,725][WARN ][o.e.t.OutboundHandler    ] [ip-10-160-220-194] send message failed [channel: Netty4TcpChannel{localAddress=/10.160.220.194:46932, remoteAddress=10.160.220.128/10.160.220.128:8443}]
javax.net.ssl.SSLException: handshake timed out
        at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011) [netty-handler-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T01:23:08,773][WARN ][o.e.t.OutboundHandler    ] [ip-10-160-220-194] send message failed [channel: Netty4TcpChannel{localAddress=/10.160.220.194:46996, remoteAddress=10.160.220.128/10.160.220.128:8443}]
javax.net.ssl.SSLException: handshake timed out
        at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011) [netty-handler-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.43.Final.jar:4.1.43.Final]
at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:150) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:510) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:518) [netty-transport-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1050) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.43.Final.jar:4.1.43.Final]
        at java.lang.Thread.run(Thread.java:830) [?:?]

And on the disconnected node:

[2020-10-15T17:03:22,389][INFO ][o.e.c.c.JoinHelper       ] [ip-10-160-220-128] failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join] request_id [116592] timed out after [59818ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) [elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) [elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
[2020-10-15T17:03:26,549][INFO ][o.e.c.c.JoinHelper       ] [ip-10-160-220-128] last failed join attempt was 4.1s ago, failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join] request_id [116592] timed out after [59818ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1020) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

And

[2020-10-15T11:18:32,549][INFO ][o.e.c.c.JoinHelper       ] [ip-10-160-220-128] failed to join {ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=3569, lastAcceptedTerm=3568, lastAcceptedVersion=41538, sourceNode={ip-10-160-220-128}{lWvrlAZWSsa-bJYcfthmcw}{A4NwSOAvSEmBEmMxhIyIHA}{10.160.220.128}{10.160.220.128:8443}{dilm}{ml.machine_memory=16305467392, xpack.installed=true, ml.max_open_jobs=20}, targetNode={ip-10-160-220-194}{IWBZLY2HSfaU6uOrGnlAYw}{QrxbhHlkTl-66jGDQbBZgw}{10.160.220.194}{10.160.220.194:8443}{dilm}{ml.machine_memory=16305459200, ml.max_open_jobs=20, xpack.installed=true}}]}
org.elasticsearch.transport.RemoteTransportException: [ip-10-160-220-194][10.160.220.194:8443][internal:cluster/coordination/join]
Caused by: org.elasticsearch.transport.ConnectTransportException: [ip-10-160-220-128][10.160.220.128:8443] connect_timeout[30s]
        at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:995) ~[elasticsearch-7.6.1.jar:7.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:633) ~[elasticsearch-7.6.1.jar:7.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

DavidTurner · October 15, 2020, 5:32pm

Sebastian_Seith:

[2020-10-15T01:23:08,773][WARN ][o.e.t.OutboundHandler    ] [ip-10-160-220-194] send message failed [channel: Netty4TcpChannel{localAddress=/10.160.220.194:46996, remoteAddress=10.160.220.128/10.160.220.128:8443}]
javax.net.ssl.SSLException: handshake timed out
        at io.netty.handler.ssl.SslHandler$5.run(SslHandler.java:2011) [netty-handler-4.1.43.Final.jar:4.1.43.Final]

This means that connectivity between the nodes is OK but suggests it is indeed getting stuck at the verification stage.

Sebastian_Seith · October 15, 2020, 5:41pm

Ok, I wasn't certain if that could ever come up by the connection simply failing or being killed after the initial connect or if that's strictly on a failed verification. Googling around didn't provide a ton of insight on that particular exception in the past, so I'm never quite sure what I should focus on and what's a red herring. I'll take a look at the jvm settings you noted and see if I can think of anything else. One idea we've had is to try an external cert authority just to see what happens, but I'd appreciate any guidance you have on what to look at first.

Sebastian_Seith · October 16, 2020, 12:48am

I finally had a chance to circle back on this. Reading the link you sent wasn't entirely clear on what you consider best tuned for this particular purpose, but I can confirm the default values are in-place as shown here:

jps -lvm
12267 jdk.jcmd/sun.tools.jps.Jps -lvm -Dapplication.home=/usr/lib/jvm/java-11-openjdk-amd64 -Xms8m -Djdk.module.main=jdk.jcmd
12107 org.elasticsearch.bootstrap.Elasticsearch -p /var/run/elasticsearch/elasticsearch.pid --quiet -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -Dio.netty.noUnsafe=true -Dio.netty.noKeySetOptimization=true -Dio.netty.recycler.maxCapacityPerThread=0 -Dio.netty.allocator.numDirectArenas=0 -Dlog4j.shutdownHookEnabled=false -Dlog4j2.disable.jmx=true -Djava.locale.providers=COMPAT -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.io.tmpdir=/tmp/elasticsearch-7736255287516992118 -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/lib/elasticsearch -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m -XX:MaxDirectMemorySize=2147483648 -Des.path.home=/usr/share/elasticsearch -Des.path.conf=/etc/elasticsearch -Des.distribution.flavor=default -Des.distribution.type=deb -Des.bundled_jdk=tr

DavidTurner · October 16, 2020, 8:25am

The default DNS cache settings look to be in place.

That's not to say it's not DNS, and there are other possibilities too (e.g. CRL or OCSP checks). Maybe there's better ways to diagnose hanging TLS handshakes, but personally I'd look at the packet captures.

Sebastian_Seith · October 16, 2020, 2:27pm

Alright, I'll give that a shot today and report back.

Sebastian_Seith · October 16, 2020, 8:54pm

Well, I'm not entirely sure what to make of this. I have found no alerts in the ssl traffic between nodes, however, while ES is reporting timeouts and issues, I have seen a clear pattern of reconnecting and renegotiating SSL between two of the nodes (we'll call them 1 and 2) every so often. Sometimes it finishes the handshake and encrypted application data starts flowing, sometimes only a Client Hello shows up, sometimes it gets to Server Hello in response and sometimes they make it all the way to the ciphr change and encrypted traffic. Meanwhile, the connection between nodes 1 and 3 remains constant with application data flowing freely at all times.

The odd part is that there doesn't seem to be any alert or other indication of a problem anywhere in the output, it just shows frequent, new "Client Hello" packets.

It's possible I'm missing something, though, this is what I've got for tshark:

tshark -V -i eth0 -f "port 8443" -Y ssl

DavidTurner · October 16, 2020, 10:48pm

Right, so the question is whether those handshakes are triggering other network traffic that isn't getting a timely response, for instance DNS lookups or maybe HTTP(S). You'll need to look further afield than just a single TCP port.

Sebastian_Seith · October 20, 2020, 6:54pm

Well, I managed to capture a few encrypted alerts to something in aws on the 52.94.XXX.XXX ip range, so I'm looking into potential issues with our network team. In the meantime, though, I found a few of these, recently that might shed some light on the issue:

2020-10-20T09:30:33,775][WARN ][o.e.t.TcpTransport       ] [ip-10-160-220-58] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.160.220.58:8443, remoteAddress=/10.160.220.128:34606}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

DavidTurner · October 20, 2020, 10:16pm

Sadly not, this is just something that's logged when a TLS connection is closed abruptly but doesn't give any information as to why. These messages were demoted to DEBUG level recently.

system · November 17, 2020, 10:16pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Random node disconnects - Java.io.IOException: Connection timed out Elasticsearch	2	5401	July 5, 2017
Why does node disconnect after three time big gc? Elasticsearch	20	1747	February 19, 2019
Long period of querying failure during node timeout Elasticsearch	4	1025	May 15, 2020
Xpack.security.enabled: true - the cluster fails to connect the nodes Elasticsearch elastic-stack-security	19	1436	October 30, 2019
Secure connection on HTTP layer Elasticsearch elastic-stack-security	14	7040	September 3, 2019

Nodes eventually disconnect with ssl enabled on transport layer

Related Topics