Hi,
I'm using Elasticsearch 2.3.3 with Oracle Java 1.8.0_91, and after weeks without restart, my cluster stops working because of direct buffer out of memory. In the mean time, I can see the direct buffer pool (which is I guess the JVM direct buffer) increasing slowly up to around 16 Go. My guess is that as soon as the direct buffer pool has reach a maximum value, the JVM start to complains about direct buffer out of memory, and Elasticsearch stops working. My question is, how to explains a so big usage of the direct buffer pool, with a curve which never decrease.
Here is an stack strace example of the out of memory:
[2016-10-06 07:02:29,483][WARN ][http.netty ] [server-01] Caught exception while handling client http traffic, closing connection [id: 0x80666f01, /10.0.0.1:58323 => /10.0.0.2:9200]
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:693)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:174)
at sun.nio.ch.IOUtil.write(IOUtil.java:58)
at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
at org.jboss.netty.channel.socket.nio.SocketSendBufferPool$UnpooledSendBuffer.transferTo(SocketSendBufferPool.java:203)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:201)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromTaskLoop(AbstractNioWorker.java:151)
at org.jboss.netty.channel.socket.nio.AbstractNioChannel$WriteTask.run(AbstractNioChannel.java:315)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:391)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:315)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
About the cluster, it's running on 7 nodes with 64 Go of memory each, and 16 Go for the heap. It handles an quite heavy input stream to index (several thousand requests per second).
Tell me if more informations are needed to troubleshoot this problem.
Any help on the subject will be appreciated. Thanks in advance.