IDs query tripping circuit breakers

Due to a previously discussed issue with aggregations, I have set my circuit breakers rather low, to prevent Elasticsearch from exiting. However, I am, among other issues, now seeing some strange behaviour on Queries from a Java application that uses an IdsQueryBuilder to build a simple query. It is looking on one index and type for one ID. The resulting document is a few k in size. Nothing special or large. However, here is the error I'm seeing in the logging:
Caught Exception: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [15048704143/14gb], which is larger than the limit of [3749380096/3.4gb]] while executing. ESIndex: indexName/typeName ; query: {
"ids" : {
"type" : [
"typeName"
],
"values" : [
"idOfObject"
],
"boost" : 1.0
}
}
CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [15048704143/14gb], which is larger than the limit of [3749380096/3.4gb]]
at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:215)
at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128)
at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1465)
at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1360)
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:280)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:396)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:248)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:624)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:524)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:478)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:438)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.lang.Thread.run(Thread.java:748)

Hmm.. that's strange, there should be something inside of the [] for what is incrementing the breaker.

Can you attach (or link) the output of /_nodes/stats for this cluster? I'd like to see what the breaker values are at for all breakers.

Also, what version of ES are you running?

Version of ES: 5.4.0

Due to this breaker tripping on ID lookups I had to revert from my previous breaker settings (put in place due to https://github.com/elastic/elasticsearch/issues/24359)

The full node stats won't fit here, but I can show you what I changed and then got this breaker:

indices.breaker.total.limit: 50% from 70
indices.breaker.request.overhead: 550 from 1

This is definitely going to cause a problem, by increasing it to 550, you are essentially multiplying all the memory estimations by 550 times when the circuit breaker is doing its limit checks (so a 1024 byte request is trying to check that it can add 563200 bytes). I definitely recommend that you reset this back to 1.

Understood. I've reverted those changes.
In this thread: https://github.com/elastic/elasticsearch/issues/15892 one user's solution to avoid the Thread exit that occurred due to the issues explained in the linked issues was to set the breakers very conservatively.

Any suggestions on how to set the breakers to avoid the aggregation memory allocation issues? This combination of issues is making it very difficult for me to sleep at night given that currently any of our users could bring down individual nodes or the whole cluster based on the query / visualization.

Not necessarily changing the breaker settings, but I believe you should be able to use the solution that was recommended here to alleviate this for now:

Also, just so you know (I know it doesn't currently help you), there is an issue open to work on a fix for this, and a pull request for fixing this was just opened yesterday, it's currently targeted for 5.4.2, 5.5.0, and 6.x.

The workaround works, but in my case it isn't a practical solution because my users create their own queries and visualizations (and would have to put the execution hint in each time). Most of their queries and aggregations work just fine. Until they do something that doesn't, and then it is too late. While some of this may be self-induced by the level of access our analysts have to our data, and their level of training, the best protection I can put in place is one that doesn't rely on every user changing every aggregation query they run.

Additionally, given that 5.4.1 isn't out, knowing that this is in 5.4.2 doesn't make me optimistic that this will be resolved soon.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.