Node shutdown due to OutOfMemoryError: Direct buffer memory

Odd_Erik_Gronberg · October 31, 2019, 7:35am

Hi,

We are running a 5 node cluster with 3 data nodes (where 1 is master eligible) and 2 master (non-data eligible nodes) as windows services.

Each of the data node has 6 GB allocated heap memory and runs on machines with 20 GB RAM available. (from startup log: heap size [5.8gb], compressed ordinary object pointers [true])

After a long period of stability we decided a couple of days ago to enable HTTP compression from the client we use for requesting ES data (NEST .net client). Later the same night we experienced unexpected shutdowns of the data nodes. This has happened several times since then. In the logs I can find that the shutdown is caused by a OOM (OutOfMemoryError: Direct buffer memory).

The monitoring graphs does not indicate that we are using all the allocated heap memory when the exception occurs:

Any help is highly appreciated!

Odd_Erik_Gronberg · October 31, 2019, 7:37am

Log:
[2019-10-31T00:10:37,419][WARN ][o.e.m.j.JvmGcMonitorService] [krs02es14] [gc][35016] overhead, spent [1.4s] collecting in the last [2.1s]
[2019-10-31T00:10:37,938][ERROR][o.e.ExceptionsHelper     ] [krs02es14] fatal error
	at org.elasticsearch.ExceptionsHelper.lambda$maybeDieOnAnotherThread$4(ExceptionsHelper.java:300)
	at java.util.Optional.ifPresent(Optional.java:159)
	at org.elasticsearch.ExceptionsHelper.maybeDieOnAnotherThread(ExceptionsHelper.java:290)
	at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addPromise$1(Netty4TcpChannel.java:88)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:476)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
	at io.netty.util.concurrent.PromiseCombiner.tryPromise(PromiseCombiner.java:170)
	at io.netty.util.concurrent.PromiseCombiner.access$600(PromiseCombiner.java:35)
	at io.netty.util.concurrent.PromiseCombiner$1.operationComplete0(PromiseCombiner.java:62)
	at io.netty.util.concurrent.PromiseCombiner$1.operationComplete(PromiseCombiner.java:44)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:502)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:476)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:415)
	at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:540)
	at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:533)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:114)
	at io.netty.util.internal.PromiseNotificationUtil.tryFailure(PromiseNotificationUtil.java:64)
	at io.netty.channel.ChannelOutboundBuffer.safeFail(ChannelOutboundBuffer.java:721)
	at io.netty.channel.ChannelOutboundBuffer.remove0(ChannelOutboundBuffer.java:306)
	at io.netty.channel.ChannelOutboundBuffer.failFlushed(ChannelOutboundBuffer.java:658)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.closeOutboundBufferForShutdown(AbstractChannel.java:686)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.shutdownOutput(AbstractChannel.java:679)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:954)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:360)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:906)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1370)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:727)
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741)
	at io.netty.channel.AbstractChannelHandlerContext.access$2100(AbstractChannelHandlerContext.java:56)
	at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1150)
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1073)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.lang.Thread.run(Thread.java:748)

Odd_Erik_Gronberg · October 31, 2019, 7:38am

[2019-10-31T00:10:37,938][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [krs02es14] fatal error in thread [Thread-14368], exiting
java.lang.OutOfMemoryError: Direct buffer memory
	at java.nio.Bits.reserveMemory(Bits.java:693) ~[?:1.8.0_131]
	at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123) ~[?:1.8.0_131]
	at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) ~[?:1.8.0_131]
	at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241) ~[?:?]
	at sun.nio.ch.IOUtil.write(IOUtil.java:58) ~[?:?]
	at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) ~[?:?]
	at io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:405) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:939) ~[?:?]
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.flush0(AbstractNioChannel.java:360) ~[?:?]
	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:906) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1370) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:727) ~[?:?]
	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:127) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:749) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:741) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.access$2100(AbstractChannelHandlerContext.java:56) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1150) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1073) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:405) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) ~[?:?]
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

Christian_Dahlqvist · October 31, 2019, 7:39am

Heap should never be more than 50% of available RAM as Elasticsearch need a good amount of off-heap memory storage and page cache to operate efficiently. You should also make sure you have at least 3 master eligible nodes in the cluster.

Odd_Erik_Gronberg · October 31, 2019, 8:46am

Thanks! Yes, total memory available on server is 20 GB so we should be safely within the limit here. Also I forgot to mention that one of the data node is master eligible. So there should be 3 master eligible in total. Is it configurable how much off-heap memory that can be used by ES?

Christian_Dahlqvist · October 31, 2019, 9:26am

What about the 6GB you said was allocated to Elasticsearch?

Odd_Erik_Gronberg · October 31, 2019, 9:45am

Yes, allocated heap size is 6 GB and total server memory is 20 GB.

Armin_Braun · October 31, 2019, 10:09am

Hi @Odd_Erik_Gronberg,

sorry if I missed it somewhere in the above but could you add which ES version you are running so we can better help you please? There have been some significant changes across the direct memory use behaviour across versions in recent times and the investigation here depends on the version quite a bit.

Thanks!

Odd_Erik_Gronberg · October 31, 2019, 10:41am

Sure @Armin_Braun , We are running on version 7.3.0

Armin_Braun · October 31, 2019, 11:15am

This looks somewhat suspicious relative to the graph of heap usage you posted. Do you have swapping enabled maybe? If so I would suggest turning it off as explained here: Disable swapping | Elasticsearch Guide [7.3] | Elastic

That would explain the long GC despite not hitting any memory limits.

Yes it is. You can add an entry -XX:MaxDirectMemorySize= to your jvm.options to configure a size limit here. In your case the limit should be 3g by default which in theory should be plenty. You can try increasing it to see if it fixes the situation though as that would provide a valuable datapoint for the investigation if this isn't related to swapping.

Odd_Erik_Gronberg · October 31, 2019, 11:51am

I have checked the elasticsearch.yml file and we have the following setting present : bootstrap.memory_lock: true so I don't think that is the case.

Odd_Erik_Gronberg · October 31, 2019, 11:55am

So I checked the log for JVM arguments from node startup:

[2019-10-31T10:12:33,137][INFO ][o.e.n.Node] [krs02es14]
JVM arguments [
-Xms1g,
-Xmx1g,
-XX:+UseConcMarkSweepGC,
-XX:CMSInitiatingOccupancyFraction=75,
-XX:+UseCMSInitiatingOccupancyOnly,
-Des.networkaddress.cache.ttl=60,
-Des.networkaddress.cache.negative.ttl=10,
-XX:+AlwaysPreTouch,
-Xss1m,
-Djava.awt.headless=true,
-Dfile.encoding=UTF-8,
-Djna.nosys=true,
-XX:-OmitStackTraceInFastThrow,
-Dio.netty.noUnsafe=true,
-Dio.netty.noKeySetOptimization=true,
-Dio.netty.recycler.maxCapacityPerThread=0,
-Dlog4j.shutdownHookEnabled=false,
-Dlog4j2.disable.jmx=true,
-Djava.io.tmpdir=C:\Users\adminodd\AppData\Local\Temp\elasticsearch,
-XX:+HeapDumpOnOutOfMemoryError,
-XX:HeapDumpPath=data,
-XX:ErrorFile=logs/hs_err_pid%p.log,
-XX:+PrintGCDetails,
-XX:+PrintGCDateStamps,
-XX:+PrintTenuringDistribution,
-XX:+PrintGCApplicationStoppedTime,
-Xloggc:logs/gc.log,
-XX:+UseGCLogFileRotation,
-XX:NumberOfGCLogFiles=32,
-XX:GCLogFileSize=64m,
-Dio.netty.allocator.type=unpooled,
-XX:MaxDirectMemorySize=536870912,
-Delasticsearch,
-Des.path.home=C:\Elasticsearch\elasticsearch-7.3.0,
-Des.path.conf=C:\Elasticsearch\elasticsearch-7.3.0\config,
-Des.distribution.flavor=default,
-Des.distribution.type=zip,
-Des.bundled_jdk=true, exit, abort,
-Xms6072m,
-Xmx6072m,
-Xss1024k]

Does this mean that the MaxDirectMemorySize is under allocated and should be changed to 3 GB? How is this calculated/set? I see that the Xms and Xmx is listed twice as well (first with 1GB and then with 6GB) . Could it be that MaxDirectMemorySize is beeing calculated based on the first values here?

Thanks for your help !

Armin_Braun · October 31, 2019, 12:33pm

@Odd_Erik_Gronberg

yea it looks like it's just 512M indeed (that would be the auto-calculated value based on 1G heap).

=> Yes, I'd try setting it higher to 2 or 3 GB. 512M might be a little low indeed.

Odd_Erik_Gronberg · October 31, 2019, 12:46pm

Great! I'll try to change that setting on all data nodes and let it run for a while to see if that does the trick!

Thanks!

DavidTurner · October 31, 2019, 12:57pm

These settings are duplicated. I wonder if this is the source of the issue. Rather than adjusting MaxDirectMemorySize directly could you address this duplication?

Odd_Erik_Gronberg · November 1, 2019, 7:21am

Hi @DavidTurner. It seems like this is the result of the default Java Options used when installing the windows service initially. Those settings are left in Java Options regardless of what tweaks I do on the Initial/Maximum memory pool setting (see screenshot). I guess the order has something to do here. As Xms6072m/Xmx6072m are passed in latest it seems like these are the ones used. I will remove the duplicates so it doesn't make any more confusion. But increasing the MaxDirectMemorySize to 3g seemed to do the trick. No shutdowns since I changed the configuration.

1_Files_store_

Thanks!

Odd_Erik_Gronberg · November 1, 2019, 7:25am

@Armin_Braun Increasing the MaxDirectMemorySize to 3g seemed to do the trick. No shutdowns since I changed the configuration yesterday. Thanks for you help. It’s highly appreciated!

DavidTurner · November 1, 2019, 8:11am

Thanks for the extra information @Odd_Erik_Gronberg. I opened https://github.com/elastic/windows-installers/issues/344 to ask for further investigation on our side.

system · November 29, 2019, 8:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Direct buffer memory problem on master Discovery Elasticsearch	8	925	July 6, 2017
Elasticsearch (6.4.1) - JVM OutOfMemoryError Elasticsearch	5	1008	June 26, 2019
Out of direct memory on indexing Elasticsearch	6	1799	July 6, 2017
Direct buffer memory error intermittently while initializing ES Connection Elasticsearch	6	2235	July 5, 2017
Getting OOME's (Out of Memory Exceptions) to stop Elasticsearch	3	1742	July 6, 2017

Node shutdown due to OutOfMemoryError: Direct buffer memory

Related topics