Java application disconnects from Elasticsearch cluster

We have a 3 node Elasticsearch cluster behind a AWS ELB. Our Java application communicates to the cluster using the Elasticsearch Java client pointing to the ELB.

We have been noticing that the application intermittently loses access to the cluster. There is no repeatable pattern to the disconnect, nor does the log messages on either end show any meaningful information on why the disconnect happened.

Typical log message in ES client (application) logs:
[ INFO] 2018-02-25 03:53:22,758 org.elasticsearch.client.transport - [Amiko Kobayashi] failed to get node info for [#transport#-1][localhost][inet[indexer.xxx.xxx.xxx.xxx/10.78.5.137:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[indexer.xxx.xxx.xxx.xxx/10.78.5.137:9300]][cluster:monitor/nodes/info] request_id [151663] timed out after [5000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Typical Log message in the ES nodes logs:
[2018-02-28 21:45:14,589][TRACE][transport.netty ] [xxx.xxx.xxx.xxx.xxx.xxx.xxx.xxx] close connection exception caught on transport layer [[id: 0xa52121cb, /10.78.5.120:19418 :> /10.78.5.115:9300]], disconnecting from relevant node
java.nio.channels.ClosedChannelException
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Also worth noting that this issue has happened at idle time (with very little data in the cluster) and also happened when significant amount of data is being ingested. So it does not seem like load related.

Any ideas on which areas to focus investigation would be much appreciated.

If you mean that you are using the TransportClient to connect to the cluster through an AWS ELB, then this is not recommended and not really needed.

The TransportClient creates persistent connections to other nodes and AWS ELB will eventually cut idle connection.

Also, using an external LB is not really needed in this case since the TransportClient is an LB on its own.

Hi,

To follow up on this, we're seeing timeouts from Elasticsearch after 5000ms, which is not even close to the idle timeout on the AWS ELB configuration (1800s).

We're also looking into using the AWS NLB (https://docs.aws.amazon.com/elasticloadbalancing/latest/network/introduction.html) which is a Layer 4 network load balancer, is there any opinion on this or is the general consensus that any type of load balancer can cause issues ?

Generally speaking, we don't recommend using L4 LB for transport connection.

Also, keep in mind that the TransportClient has been deprecated and our recommendation is to switch to the High Level REST Client in which you can use an HTTP LB.

Thanks, that helps a lot!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.