We have a 3 node Elasticsearch cluster behind a AWS ELB. Our Java application communicates to the cluster using the Elasticsearch Java client pointing to the ELB.
We have been noticing that the application intermittently loses access to the cluster. There is no repeatable pattern to the disconnect, nor does the log messages on either end show any meaningful information on why the disconnect happened.
Typical log message in ES client (application) logs:
[ INFO] 2018-02-25 03:53:22,758 org.elasticsearch.client.transport - [Amiko Kobayashi] failed to get node info for [#transport#-1][localhost][inet[indexer.xxx.xxx.xxx.xxx/10.78.5.137:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[indexer.xxx.xxx.xxx.xxx/10.78.5.137:9300]][cluster:monitor/nodes/info] request_id [151663] timed out after [5000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Typical Log message in the ES nodes logs:
[2018-02-28 21:45:14,589][TRACE][transport.netty ] [xxx.xxx.xxx.xxx.xxx.xxx.xxx.xxx] close connection exception caught on transport layer [[id: 0xa52121cb, /10.78.5.120:19418 :> /10.78.5.115:9300]], disconnecting from relevant node
java.nio.channels.ClosedChannelException
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:433)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Also worth noting that this issue has happened at idle time (with very little data in the cluster) and also happened when significant amount of data is being ingested. So it does not seem like load related.
Any ideas on which areas to focus investigation would be much appreciated.