ILM Forcemerge Internal Server Error

We have ILM enabled for our indices and a number of the larger indices will fail hours into the forcemerge action with errors similar to the below:

{
  "type": "exception",
  "reason": "index [prod-jaeger-span-2022-05-11] in policy [custom-jaeger] encountered failures [{\"shard\":14,\"index\":\"prod-jaeger-span-2022-05-11\",\"status\":\"INTERNAL_SERVER_ERROR\",\"reason\":{\"type\":\"i_o_exception\",\"reason\":\"background merge hit exception: _16q(8.10.1):C21799956:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652250163137}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8ecww _61y(8.10.1):c25320650:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=8, timestamp=1652308498719}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8ua32 _5fy(8.10.1):C25375232:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652300718972}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8s7pc _1x4(8.10.1):C25592620:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=9, timestamp=1652258826332}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8glq0 _51d(8.10.1):C25187388:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652295573012}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8qxyb _2ip(8.10.1):C25006878:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652265602770}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8ii2k _32e(8.10.1):C24976368:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652271650446}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8k7bz _4ma(8.10.1):C25180054:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=6, timestamp=1652290231688}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8pnok _45v(8.10.1):C25141458:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=6, timestamp=1652284698326}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8o3ci _3nl(8.10.1):C25106026:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=7, timestamp=1652278647274}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8m83w into _6bh [maxNumSegments=1] [ABORTED]\",\"caused_by\":{\"type\":\"i_o_exception\",\"reason\":\"Merge aborted.\",\"suppressed\":[{\"type\":\"i_o_exception\",\"reason\":\"Merge aborted.\"}]}}}] on step [forcemerge]",
  "stack_trace": "ElasticsearchException[index [prod-jaeger-span-2022-05-11] in policy [custom-jaeger] encountered failures [{\"shard\":14,\"index\":\"prod-jaeger-span-2022-05-11\",\"status\":\"INTERNAL_SERVER_ERROR\",\"reason\":{\"type\":\"i_o_exception\",\"reason\":\"background merge hit exception: _16q(8.10.1):C21799956:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652250163137}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8ecww _61y(8.10.1):c25320650:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=8, timestamp=1652308498719}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8ua32 _5fy(8.10.1):C25375232:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652300718972}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8s7pc _1x4(8.10.1):C25592620:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=9, timestamp=1652258826332}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8glq0 _51d(8.10.1):C25187388:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652295573012}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8qxyb _2ip(8.10.1):C25006878:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652265602770}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8ii2k _32e(8.10.1):C24976368:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=10, timestamp=1652271650446}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8k7bz _4ma(8.10.1):C25180054:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=6, timestamp=1652290231688}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8pnok _45v(8.10.1):C25141458:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=6, timestamp=1652284698326}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8o3ci _3nl(8.10.1):C25106026:[diagnostics={os=Linux, java.version=11.0.15, os.arch=amd64, java.runtime.version=11.0.15+10, source=merge, os.version=5.4.170+, java.vendor=Eclipse Adoptium, java.vm.version=11.0.15+10, lucene.version=8.10.1, mergeMaxNumSegments=-1, mergeFactor=7, timestamp=1652278647274}]:[attributes={Lucene87StoredFieldsFormat.mode=BEST_COMPRESSION}] :id=6tcahwh20o60oz00ezdo8m83w into _6bh [maxNumSegments=1] [ABORTED]\",\"caused_by\":{\"type\":\"i_o_exception\",\"reason\":\"Merge aborted.\",\"suppressed\":[{\"type\":\"i_o_exception\",\"reason\":\"Merge aborted.\"}]}}}] on step [forcemerge]]\n\tat org.elasticsearch.xpack.core.ilm.ForceMergeStep.lambda$performAction$0(ForceMergeStep.java:80)\n\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:136)\n\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88)\n\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:82)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:419)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:386)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:362)\n\tat org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:354)\n\tat org.elasticsearch.transport.TransportService$4.handleResponse(TransportService.java:847)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1471)\n\tat org.elasticsearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:340)\n\tat org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:324)\n\tat org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:134)\n\tat org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:88)\n\tat org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:743)\n\tat org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:147)\n\tat org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:119)\n\tat org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:84)\n\tat org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:71)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)\n\tat io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n"
}

It's not too clear what's causing this, it will retry and eventually succeed. This particular index is ~1.6TB split across 30 shards but I've observed the same error on some smaller indices (~120gb, 20 shards)

ES Version: 7.16.2

Shards are running on warm nodes at this point, of which there are 7 warm nodes, 15 CPU, 53Gi

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Hi Michael,

as far as I understand we abort segment merges when closing a shard (e.g. it was unassigned or moved elsewhere) and normally you don’t see this.
The abort is apparently reported if it happens during a force-merge, as seen here

You could further check the logs to understand if a shard move might have happened here as well.