[5.3] BulkProcessor never release semaphore (ByteBuffer OOME)

ebillaud · August 18, 2017, 10:18am

Problem

Here is the problem observed : while doing bulk indexing, and after a while, the semaphore is acquired in the internalAdd method of the asynchronous bulk request handler, but is never released.

After some investigations, it appears that an OutOfMemoryError occured while allocating new DirectByteBuffer (memory is never release because we set the flag DisableExplicitGC in our JVM's application).

Here is a partial view of our thread dump :

"elasticsearch[...][bulk_processor][T#1]" - Thread t@63
java.lang.Thread.State: BLOCKED
at org.elasticsearch.action.bulk.BulkProcessor$Flush.run(BulkProcessor.java:342)
- waiting to lock <597aca4b> (a org.elasticsearch.action.bulk.BulkProcessor) owned by "Sync-7" t@71

"Sync-4" - Thread t@77
java.lang.Thread.State: BLOCKED
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:286)
- waiting to lock <597aca4b> (a org.elasticsearch.action.bulk.BulkProcessor) owned by "Sync-7" t@71

"Sync-7" - Thread t@71
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <2d08aa0f> (a java.util.concurrent.Semaphore$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at java.util.concurrent.Semaphore.acquire(Semaphore.java:312)
at org.elasticsearch.action.bulk.BulkRequestHandler$AsyncBulkRequestHandler.execute(BulkRequestHandler.java:121)
at org.elasticsearch.action.bulk.BulkProcessor.execute(BulkProcessor.java:315)
at org.elasticsearch.action.bulk.BulkProcessor.executeIfNeeded(BulkProcessor.java:306)
at org.elasticsearch.action.bulk.BulkProcessor.internalAdd(BulkProcessor.java:288)
- locked <597aca4b> (a org.elasticsearch.action.bulk.BulkProcessor)

Here is the stacktrace found using JMC :

Stack Trace Count
java.lang.Throwable.(String) 1
java.lang.Error.(String) 1
java.lang.VirtualMachineError.(String) 1
java.lang.OutOfMemoryError.(String) 1
java.nio.Bits.reserveMemory(long, int) 1
java.nio.DirectByteBuffer.(int) 1
java.nio.ByteBuffer.allocateDirect(int) 1
io.netty.buffer.PoolArena$DirectArena.allocateDirect(int) 1
io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(int) 1
io.netty.buffer.PoolArena.allocateHuge(PooledByteBuf, int) 1
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int) 1
io.netty.buffer.PoolArena.allocate(PoolThreadCache, int, int) 1
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(int, int) 1
io.netty.buffer.AbstractByteBufAllocator.directBuffer(int, int) 1
io.netty.buffer.AbstractByteBufAllocator.directBuffer(int) 1
io.netty.channel.nio.AbstractNioChannel.newDirectBuffer(ByteBuf) 1
io.netty.channel.nio.AbstractNioByteChannel.filterOutboundMessage(Object) 1
io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise) 1
io.netty.channel.DefaultChannelPipeline$HeadContext.write(ChannelHandlerContext, Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext.write(Object, boolean, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext.write(Object, ChannelPromise) 1
org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.write(ChannelHandlerContext, Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext.invokeWrite(Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext, Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext, Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext, Object, ChannelPromise) 1
io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run() 1
io.netty.util.concurrent.AbstractEventExecutor.safeExecute(Runnable) 1
io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(long) 1
io.netty.channel.nio.NioEventLoop.run() 1
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() 1
java.lang.Thread.run() 1

More details

We index data using the bulk processor. Here is the configuration of the bulk processor :

setBulkActions = 30000
setBulkSize = 10 MB
setFlushInterval = 10s
setConcurrentRequests = 1
setBackoffPolicy = 8 retry, 50ms starting time

Questions

From my understandings, some messages are never sent to the cluster (at this point, I don't know which one) :
- Why is the onFailure method from the ActionListener is never called ? The backoff policy should have timed out, and triggered it : is that right ?
- From those observations, is there any more errors we should worry about (i.e. where the semaphore will never be released) ? If yes, is there any known workaround for those issues ?
I did not found much conversation about the flag recommendation besides this thread. Do you recommend to not use this flag ? If yes, this could be an addition to the current documentation (or did I miss it ?).
Is there any configuration on Elastic side to avoid this issue ?

Workaround (?)

One way to solve this issue is to remove the flag -XX:+DisableExplicitGC, thus allowing the system.gc call within the calls java.nio.Bits. By adding -XX:+ExplicitGCInvokesConcurrent, instead of a full gc, only a CMS will be triggerd.

This has not yet been tested. Tests shall be done today. I'll keep you informed.

ebillaud · August 22, 2017, 9:53am

Hey !

Here are the results of the tests :

when setting -XX:MaxDirectMemorySize=512m and keeping -XX:+DisableExplicitGC, the same problem can be observed (OOME and semaphore never released);
when removing -XX:+DisableExplicitGC, when the DirectMemory is full, a full gc is actually performed, and application is not frozen anymore (bulk indexing goes on).

In my case, the application is stateless. Therefore, the DirectMemory is full before a CMS is performed.

Conclusion : disabling the flag is a correct way to solve this issue (we are not latency critical).

If you have any answers to my previous questions, it would be greatly appreciated

system · September 19, 2017, 9:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Java process hangs during BulkRequestProcessor add - Waiting for Semaphore Elasticsearch	5	817	June 24, 2020
BulkProcessor deadlock on cluster failure Elasticsearch	1	1218	January 3, 2018
Java api client bulk processor locking up, both ES 5 and 2.3 Elasticsearch	3	666	June 27, 2017
Elasticsearch BulkRequestHandler listener callback execution Elasticsearch	1	293	February 10, 2021
Java application using BulkProcessing hangs When "circuit_breaking_exception" occurs Elasticsearch language-clients	1	405	December 17, 2021

[5.3] BulkProcessor never release semaphore (ByteBuffer OOME)

Problem

More details

Questions

Workaround (?)

Related topics