Node shutdown due to OutOfMemoryError: Direct buffer memory

Hi,

We are running a 5 node cluster with 3 data nodes (where 1 is master eligible) and 2 master (non-data eligible nodes) as windows services.

Each of the data node has 6 GB allocated heap memory and runs on machines with 20 GB RAM available. (from startup log: heap size [5.8gb], compressed ordinary object pointers [true])

After a long period of stability we decided a couple of days ago to enable HTTP compression from the client we use for requesting ES data (NEST .net client). Later the same night we experienced unexpected shutdowns of the data nodes. This has happened several times since then. In the logs I can find that the shutdown is caused by a OOM (OutOfMemoryError: Direct buffer memory).

The monitoring graphs does not indicate that we are using all the allocated heap memory when the exception occurs:

Any help is highly appreciated!

Log:
[2019-10-31T00:10:37,419][WARN ][o.e.m.j.JvmGcMonitorService] [krs02es14] [gc][35016] overhead, spent [1.4s] collecting in the last [2.1s]
[2019-10-31T00:10:37,938][ERROR][o.e.ExceptionsHelper     ] [krs02es14] fatal error
	at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$4(ExceptionsHelper.java:300)
	at java.util.Optional.ifPresent(Optional.java:159)
	at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:290)
	at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addPromise$1(Netty4TcpChannel.java:88)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:476)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
	at io.netty.util.concurrent.PromiseCombiner.tryPromise(PromiseCombiner.java:170)
	at io.netty.util.concurrent.PromiseCombiner.access$600(PromiseCombiner.java:35)
	at io.netty.util.concurrent.PromiseCombiner$1.operationComplete0(PromiseCombiner.java:62)
	at io.netty.util.concurrent.PromiseCombiner$1.operationComplete(PromiseCombiner.java:44)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:476)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
	at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64)
	at io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:721)
	at io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:306)
	at io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:658)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.closeOutboundBufferForShutdown(AbstractChannel.java:686)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.shutdownOutput(AbstractChannel.java:679)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:954)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:360)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:906)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1370)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:727)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.AbstractChannelHandlerContext.access$2100(AbstractChannelHandlerContext.java:56)
	at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1150)
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1073)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:748)
[2019-10-31T00:10:37,938][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [krs02es14] fatal error in thread [Thread-14368], exiting
java.lang.OutOfMemoryError: Direct buffer memory
	at java.nio.Bits.reserveMemory(Bits.java:693) ~[?:1.8.0_131]
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[?:1.8.0_131]
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_131]
	at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:405) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:939) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:360) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:906) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1370) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:727) ~[?:?]
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.access$2100(AbstractChannelHandlerContext.java:56) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1150) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1073) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Heap should never be more than 50% of available RAM as Elasticsearch need a good amount of off-heap memory storage and page cache to operate efficiently. You should also make sure you have at least 3 master eligible nodes in the cluster.

1 Like

Thanks! Yes, total memory available on server is 20 GB so we should be safely within the limit here. Also I forgot to mention that one of the data node is master eligible. So there should be 3 master eligible in total. Is it configurable how much off-heap memory that can be used by ES?

What about the 6GB you said was allocated to Elasticsearch?

Yes, allocated heap size is 6 GB and total server memory is 20 GB.

Hi @Odd_Erik_Gronberg,

sorry if I missed it somewhere in the above but could you add which ES version you are running so we can better help you please? There have been some significant changes across the direct memory use behaviour across versions in recent times and the investigation here depends on the version quite a bit.

Thanks!

Sure @Armin_Braun , We are running on version 7.3.0

This looks somewhat suspicious relative to the graph of heap usage you posted. Do you have swapping enabled maybe? If so I would suggest turning it off as explained here: https://www.elastic.co/guide/en/elasticsearch/reference/7.3/setup-configuration-memory.html

That would explain the long GC despite not hitting any memory limits.

Yes it is. You can add an entry -XX:MaxDirectMemorySize= to your jvm.options to configure a size limit here. In your case the limit should be 3g by default which in theory should be plenty. You can try increasing it to see if it fixes the situation though as that would provide a valuable datapoint for the investigation if this isn't related to swapping.

I have checked the elasticsearch.yml file and we have the following setting present : bootstrap.memory_lock: true so I don't think that is the case.

So I checked the log for JVM arguments from node startup:

[2019-10-31T10:12:33,137][INFO ][o.e.n.Node] [krs02es14]
JVM arguments [
-Xms1g,
-Xmx1g,
-XX:+UseConcMarkSweepGC,
-XX:CMSInitiatingOccupancyFraction=75,
-XX:+UseCMSInitiatingOccupancyOnly,
-Des.networkaddress.cache.ttl=60,
-Des.networkaddress.cache.negative.ttl=10,
-XX:+AlwaysPreTouch,
-Xss1m,
-Djava.awt.headless=true,
-Dfile.encoding=UTF-8,
-Djna.nosys=true,
-XX:-OmitStackTraceInFastThrow,
-Dio.netty.noUnsafe=true,
-Dio.netty.noKeySetOptimization=true,
-Dio.netty.recycler.maxCapacityPerThread=0,
-Dlog4j.shutdownHookEnabled=false,
-Dlog4j2.disable.jmx=true,
-Djava.io.tmpdir=C:\Users\adminodd\AppData\Local\Temp\elasticsearch,
-XX:+HeapDumpOnOutOfMemoryError,
-XX:HeapDumpPath=data,
-XX:ErrorFile=logs/hs_err_pid%p.log,
-XX:+PrintGCDetails,
-XX:+PrintGCDateStamps,
-XX:+PrintTenuringDistribution,
-XX:+PrintGCApplicationStoppedTime,
-Xloggc:logs/gc.log,
-XX:+UseGCLogFileRotation,
-XX:NumberOfGCLogFiles=32,
-XX:GCLogFileSize=64m,
-Dio.netty.allocator.type=unpooled,
-XX:MaxDirectMemorySize=536870912,
-Delasticsearch,
-Des.path.home=C:\Elasticsearch\elasticsearch-7.3.0,
-Des.path.conf=C:\Elasticsearch\elasticsearch-7.3.0\config,
-Des.distribution.flavor=default,
-Des.distribution.type=zip,
-Des.bundled_jdk=true, exit, abort,
-Xms6072m,
-Xmx6072m,
-Xss1024k]

Does this mean that the MaxDirectMemorySize is under allocated and should be changed to 3 GB? How is this calculated/set? I see that the Xms and Xmx is listed twice as well (first with 1GB and then with 6GB) . Could it be that MaxDirectMemorySize is beeing calculated based on the first values here?

Thanks for your help !

@Odd_Erik_Gronberg

yea it looks like it's just 512M indeed (that would be the auto-calculated value based on 1G heap).

=> Yes, I'd try setting it higher to 2 or 3 GB. 512M might be a little low indeed.

Great! I'll try to change that setting on all data nodes and let it run for a while to see if that does the trick!

Thanks!

These settings are duplicated. I wonder if this is the source of the issue. Rather than adjusting MaxDirectMemorySize directly could you address this duplication?

1 Like

Hi @DavidTurner. It seems like this is the result of the default Java Options used when installing the windows service initially. Those settings are left in Java Options regardless of what tweaks I do on the Initial/Maximum memory pool setting (see screenshot). I guess the order has something to do here. As Xms6072m/Xmx6072m are passed in latest it seems like these are the ones used. I will remove the duplicates so it doesn't make any more confusion. But increasing the MaxDirectMemorySize to 3g seemed to do the trick. No shutdowns since I changed the configuration.

1_Files_store_

Thanks!

@Armin_Braun Increasing the MaxDirectMemorySize to 3g seemed to do the trick. No shutdowns since I changed the configuration yesterday. Thanks for you help. It’s highly appreciated!

Thanks for the extra information @Odd_Erik_Gronberg. I opened https://github.com/elastic/windows-installers/issues/344 to ask for further investigation on our side.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.