Seems related to queries as it happened at the same time on multiple data nodes in the cluster.
From the memory dump, it seems that ~70% of the heap is taken by objects that are waiting for GC (not reachable from the root) --> DisjunctionMaxQuery
disjuncts of [Unreachable] org.apache.lucene.search.DisjunctionMaxQuery 24 24
Call stack of the thread that thorowed the OOM error:
<Thread 77> <--- OutOfMemoryError happened in this thread State: BLOCKED
java.lang.OutOfMemoryError.() OutOfMemoryError.java:48
io.netty.util.internal.PlatformDependent.allocateUninitializedArray(int) PlatformDependent.java:200
io.netty.buffer.PoolArena$HeapArena.newByteArray(int) PoolArena.java:676
io.netty.buffer.PoolArena$HeapArena.newChunk(int, int, int, int) PoolArena.java:686
io.netty.buffer.PoolArena.allocateNormal(PooledByteBuf, int, int) PoolArena.java:244
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int) PoolArena.java:226
io.netty.buffer.PoolArena.reallocate(PooledByteBuf, int, boolean) PoolArena.java:397
io.netty.buffer.PooledByteBuf.capacity(int) PooledByteBuf.java:118
io.netty.buffer.AbstractByteBuf.ensureWritable0(int) AbstractByteBuf.java:285
io.netty.buffer.AbstractByteBuf.ensureWritable(int) AbstractByteBuf.java:265
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int, int) AbstractByteBuf.java:1077
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int) AbstractByteBuf.java:1070
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf) AbstractByteBuf.java:1060
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteBufAllocator, ByteBuf, ByteBuf) ByteToMessageDecoder.java:92
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ChannelHandlerContext, Object) ByteToMessageDecoder.java:263
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.handler.logging.LoggingHandler.channelRead(ChannelHandlerContext, Object) LoggingHandler.java:241
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(ChannelHandlerContext, Object) DefaultChannelPipeline.java:1359
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.DefaultChannelPipeline.fireChannelRead(Object) DefaultChannelPipeline.java:935
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read() AbstractNioByteChannel.java:134
io.netty.channel.nio.NioEventLoop.processSelectedKey(SelectionKey, AbstractNioChannel) NioEventLoop.java:645
io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(Set) NioEventLoop.java:545
io.netty.channel.nio.NioEventLoop.processSelectedKeys() NioEventLoop.java:499
io.netty.channel.nio.NioEventLoop.run() NioEventLoop.java:459
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858
java.lang.Thread.run() Thread.java:748
We didn't change the default circuit breakers.
Cluster info:
elasticsearch 6.2.4 cluster consists of:
3 master nodes (4 cores / 14gb memory / 7gb heap)
6 client nodes (same setup)
38 hot data nodes (8 cores / 64gb memory / 30.5gb heap and 1.4tb local ssd disks)
30 cold data nodes (8 cores / 32gb memory / 16gb heap and 3tb spinning disks).
Cluster contains ~15,000 primary shards (27,000 total active shards) spread across ~8,900 indexes
Currunlty in use
11TB in hot storage(SDD)
45TB in cold storage(HDD)
Any help or guidance will be appreciated