Hi @DavidTurner I'm currently using 7.17.0.
The problem I provided was more of just an example, of an issue I saw in the past, so I don't have direct logs in this type of scenario.
But there could also be some other issues which cause this reallocation problem, i.e.: minor network fault between a node or a node crashing.
A more recent issue that I've seen cause searchable snapshots is actually kind of weird*, but the leading controller node started to reject requests because of a memory circuit breaker, which eventually caused a cold node to drop and therefore searchable snapshots to reallocate.
Here is an example of the event (I can provide more logs if wanted):
Controller Log
failed to validate incoming join request from node [{es-prod-es-rack1-ml-0}{XnfbpS0lROeWDqgcIwQjXQ}{UalEL3mYSnacUfsWCczQng}{10.42.3.200}{10.42.3.200:9300}{lr}{k8s_node_name=k8s02-es, ml.machine_memory=32212254720, xpack.installed=true, zone=rack1, transform.node=false, ml.max_open_jobs=512, ml.max_jvm_size=2147483648}]
Controller Stacktrace
org.elasticsearch.transport.RemoteTransportException: [es-prod-es-rack1-ml-0][10.42.3.200:9300][internal:cluster/coordination/join/validate], Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:cluster/coordination/join/validate] would be [2111119334/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [1935537744/1.8gb], new bytes reserved: [175581590/167.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=175581590/167.4mb, model_inference=0/0b, eql_sequence=0/0b, accounting=0/0b], at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:460) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:108) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:213) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:117) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:145) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:119) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:84) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:71) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1374) ~[?:?], at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1237) ~[?:?], at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1286) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?], at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?], at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?], at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?], at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?], at java.lang.Thread.run(Thread.java:833) [?:?]
* Weird issue as a Machine Learning node entered some weird state where it was continually failing to rejoin the cluster and effectively causing a denial of service against the lead controller. Eventually fixed the issue by restarting the ML node, at which point when it restarted, it worked again and the lead controller re-entered a healthy state.
But given that there is a verity of possible problems that can cause a temporary loss of a node, I'd like something that is more of a "band aid" to the searchable snapshot reallocating, while the root issue gets investigated/fixed.