Direct buffer memory error intermittently while initializing ES Connection

Hi All,

We are running 8 nodes Elasticsearch cluster(Version 1.5.2) with total 120 GB Heap Memory. We run a daemon after every 2 hours to derive some data, where we get intermittent(once in 6-7 days)issue while initializing connection with ElasticSearch .

Exception in thread "Timer-0" java.lang.OutOfMemoryError: Direct buffer memory
        at java.nio.Bits.reserveMemory(Bits.java:658)
        at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
        at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
        at org.elasticsearch.common.netty.channel.socket.nio.SocketSendBufferPool$Preallocation.<init>(SocketSendBufferPool.java:156)
        at org.elasticsearch.common.netty.channel.socket.nio.SocketSendBufferPool.<init>(SocketSendBufferPool.java:42)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.<init>(AbstractNioWorker.java:45)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.<init>(NioWorker.java:45)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.newWorker(NioWorkerPool.java:44)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.newWorker(NioWorkerPool.java:28)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorkerPool.init(AbstractNioWorkerPool.java:80)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.<init>(NioWorkerPool.java:39)
        at org.elasticsearch.common.netty.channel.socket.nio.NioWorkerPool.<init>(NioWorkerPool.java:33)
        at org.elasticsearch.transport.netty.NettyTransport.createClientBootstrap(NettyTransport.java:298)
        at org.elasticsearch.transport.netty.NettyTransport.doStart(NettyTransport.java:224)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
        at org.elasticsearch.transport.TransportService.doStart(TransportService.java:153)
        at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:85)
        at org.elasticsearch.client.transport.TransportClient.<init>(TransportClient.java:189)

Please let me know if you have any idea behind this error.

It would be great help if anyone can share their suggestions to resolve this error.

Are you controlling the size of bulk requests that are being sent to ES? If you're just controlling the number of documents (e.g. 10,000 docs per bulk), you can have variable size (10k docs at 10 bytes each is very different from 10k docs at 100kb each).

Netty allocates direct buffer memory when receiving requests, so if you have a particularly large bulk (or series of bulks arriving at the same time) you can exhaust your direct buffer space.

Not sure if that's the cause in this case, but I've seen similar/related problems in the past so that'd be the first thing to check.

We are running 8 nodes Elasticsearch cluster(Version 1.5.2) with total 120 GB Heap Memory.

Also, just checking, do you mean 120gb for each node? Or 120gb for the entire cluster? What's the server memory capacity, and how much are you allocating to heap for each node?

We are sending only 1000 docs per bulk request, and each doc is of very small size, say less than 1 KB. So this would not be cause of this error.
I also found same reason mentioned in other mail-threads so I verified our bulk request size again but there is no issue of bulk request size in our case.

120 GB Heap memory is for entire cluster. Total Server memory capacity is 240 GB and we have assigned 50% for ES Heap memory.
There are total 8 nodes, and on each node 15 GB Heap Memory is assigned. We have also enabled memlock to avoid memory swapping.

Oh, sorry, I misread your original question. This error is happening in your application code, while initializing a transport client to talk to the cluster, right? Not an exception on the servers?

If that's the case, are you sure you are cleaning up old transport clients? Does the daemon keep a persistent connection to the cluster, or initialize a new one every 2hours? E.g. I'm thinking that perhaps your daemon is allocating a new TC, but never de-allocating the old ones, so they are sitting around eating up overhead until it eventually fails.

Similarly, if this is your application code, have you looked for unrelated memory leaks? The netty error may just be a symptom caused by something unrelated.

Thanks for quick response.

Our daemon is allocating a new TC every time, I will have to check the code if old TC are deallocated or not.