Elasticsearch data node crashing with OutOfMemoryError

Seems related to queries as it happened at the same time on multiple data nodes in the cluster.
From the memory dump, it seems that ~70% of the heap is taken by objects that are waiting for GC (not reachable from the root) --> DisjunctionMaxQuery
disjuncts of [Unreachable] org.apache.lucene.search.DisjunctionMaxQuery 24 24

Call stack of the thread that thorowed the OOM error:
<Thread 77> <--- OutOfMemoryError happened in this thread State: BLOCKED
java.lang.OutOfMemoryError.() OutOfMemoryError.java:48
io.netty.util.internal.PlatformDependent.allocateUninitializedArray(int) PlatformDependent.java:200
io.netty.buffer.PoolArena$HeapArena.newByteArray(int) PoolArena.java:676
io.netty.buffer.PoolArena$HeapArena.newChunk(int, int, int, int) PoolArena.java:686
io.netty.buffer.PoolArena.allocateNormal(PooledByteBuf, int, int) PoolArena.java:244
io.netty.buffer.PoolArena.allocate(PoolThreadCache, PooledByteBuf, int) PoolArena.java:226
io.netty.buffer.PoolArena.reallocate(PooledByteBuf, int, boolean) PoolArena.java:397
io.netty.buffer.PooledByteBuf.capacity(int) PooledByteBuf.java:118
io.netty.buffer.AbstractByteBuf.ensureWritable0(int) AbstractByteBuf.java:285
io.netty.buffer.AbstractByteBuf.ensureWritable(int) AbstractByteBuf.java:265
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int, int) AbstractByteBuf.java:1077
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf, int) AbstractByteBuf.java:1070
io.netty.buffer.AbstractByteBuf.writeBytes(ByteBuf) AbstractByteBuf.java:1060
io.netty.handler.codec.ByteToMessageDecoder$1.cumulate(ByteBufAllocator, ByteBuf, ByteBuf) ByteToMessageDecoder.java:92
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ChannelHandlerContext, Object) ByteToMessageDecoder.java:263
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.handler.logging.LoggingHandler.channelRead(ChannelHandlerContext, Object) LoggingHandler.java:241
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(Object) AbstractChannelHandlerContext.java:340
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(ChannelHandlerContext, Object) DefaultChannelPipeline.java:1359
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(Object) AbstractChannelHandlerContext.java:362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext, Object) AbstractChannelHandlerContext.java:348
io.netty.channel.DefaultChannelPipeline.fireChannelRead(Object) DefaultChannelPipeline.java:935
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read() AbstractNioByteChannel.java:134
io.netty.channel.nio.NioEventLoop.processSelectedKey(SelectionKey, AbstractNioChannel) NioEventLoop.java:645
io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(Set) NioEventLoop.java:545
io.netty.channel.nio.NioEventLoop.processSelectedKeys() NioEventLoop.java:499
io.netty.channel.nio.NioEventLoop.run() NioEventLoop.java:459
io.netty.util.concurrent.SingleThreadEventExecutor$5.run() SingleThreadEventExecutor.java:858
java.lang.Thread.run() Thread.java:748

GC logs

We didn't change the default circuit breakers.

Cluster info:
elasticsearch 6.2.4 cluster consists of:

3 master nodes (4 cores / 14gb memory / 7gb heap)
6 client nodes (same setup)
38 hot data nodes (8 cores / 64gb memory / 30.5gb heap and 1.4tb local ssd disks)
30 cold data nodes (8 cores / 32gb memory / 16gb heap and 3tb spinning disks).
Cluster contains ~15,000 primary shards (27,000 total active shards) spread across ~8,900 indexes
Currunlty in use
11TB in hot storage(SDD)
45TB in cold storage(HDD)

Any help or guidance will be appreciated :slight_smile:

4 Likes

Seems like org.apache.lucene.search.DisjunctionMaxQuery is being spawned at a rate that the GC can't keep up with cleaning it. Any way to limit it?

For the OOM stacktrace, is this taken from the elasticsearch logfile directly? I'm wondering if there is more information that could be missing.

There is no way to limit the creation of this class. What is the search rate on your cluster? Do you know if there are a large number of DisMaxQueries or MultiMatchQueries?

Thanks for replying, I was able to figure out a few more things form the memory dump and an abusive query which I believe is the root cause of the OOM.

The DisjunctionMaxQuery objects are referenced by org.apache.lucene.search.BooleanClause.
Because of these objects are all related to queries. I started looking for query objects in the dump.
There was only one: org.elasticsearch.index.query.QueryStringQueryBuilder, when inspecting this object queryString member I was able to find the query which I believe caused the OOM.
The query had around 1000 AND clauses! and many *. As we receive the queries from customers, we add a protection limit on our end of 30 clauses, which I hope will be enough to avoid the OOM.

If the too many clauses is indeed the issue, I was wondering if it might be a good idea to suggest adding a default limit on the number of clauses in one elasticsearch query (that can be changed of course) ... What do you think?

Regarding your questions:

For the OOM stack trace, is this taken from the elasticsearch logfile directly? I'm wondering if there is more information that could be missing.

The stack trace is from the memory dump. I didn't see anything in the logfiles (let me know if you want them)

What is the search rate on your cluster?

I don't think the rate is too high as I only found one query object in the dump.

Do you know if there are a large number of DisMaxQueries or MultiMatchQueries?

I didn't find any direct places were we use it UPDATE: we use should inside filter and it seems that this add by default minimum must match which from what I understand uses DisMaxQueriesQueries`

Thanks!

I am glad you were able to track down a suspected/probable source of the problem.

There is a node level setting indices.query.bool.max_clause_count that defaults to 1024 for the maximum number of clauses in a boolean query.

@jaymode thanks for the new Information regarding indices.query.bool.max_clause_count.
I saw that the giant query we had has more than 1024 AND clauses.
Does indices.query.bool.max_clause_count truncates the query when it is above the limit or it won't perform it at all (will throw an error)?

1 Like

The Lucene query should fail when we transform the Elasticsearch query builder (in this case QueryStringQueryBuilder) into an actual Lucene query. We could look into failing sooner, but I need to check to see how this would affect query re-writing that we do. In some cases we can reduce a full query to a match all/match none and significantly speed things up.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.