Mounted Searchable Snapshot Delay Reallocation/Recovery

Hi All,

I'm using searchable snapshots on a cluster and was wondering if there is a way to delay whether a searchable snapshot gets reallocated for a period of time.

Example problem, if I have a cold node which gets hit by a hard query, the CPU can spike to 100% and "lock" the node for more than a minute. This is long enough for the Elasticsearch cluster to think that the node is "dead" and begins the recovery/reallocation process of mounted searchable snapshots for that node. This is bad for me as a single node can be 10TB of data that gets needlessly reshuffled.

I'd like to be able to do something, where I can tell Elasticsearch that if the node becomes unavailable, wait for X minutes before trying to reallocat/recover searchable snapshot indices.

I found Delaying allocation when a node leaves, and am thinking of doing something like:

PUT /restored-*/_settings?timeout=60s
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "10m"
  }
}

To apply to all searchable snapshot indices.

But what I can't find is whether this would work like this, as there really isn't any mention of how searchable snapshots interact with this setting.

Would anyone know if this would work, or if there is another way to achieve this?

Seems to me that it would be better to address this fundamental problem instead of just improving how Elasticsearch reacts to it. It should definitely be able to distinguish a busy node from a dead one. What version are you using? What do you see in the logs at this time?

Hi @DavidTurner I'm currently using 7.17.0.

The problem I provided was more of just an example, of an issue I saw in the past, so I don't have direct logs in this type of scenario.

But there could also be some other issues which cause this reallocation problem, i.e.: minor network fault between a node or a node crashing.

A more recent issue that I've seen cause searchable snapshots is actually kind of weird*, but the leading controller node started to reject requests because of a memory circuit breaker, which eventually caused a cold node to drop and therefore searchable snapshots to reallocate.

Here is an example of the event (I can provide more logs if wanted):

Controller Log

failed to validate incoming join request from node [{es-prod-es-rack1-ml-0}{XnfbpS0lROeWDqgcIwQjXQ}{UalEL3mYSnacUfsWCczQng}{10.42.3.200}{10.42.3.200:9300}{lr}{k8s_node_name=k8s02-es, ml.machine_memory=32212254720, xpack.installed=true, zone=rack1, transform.node=false, ml.max_open_jobs=512, ml.max_jvm_size=2147483648}]

Controller Stacktrace

org.elasticsearch.transport.RemoteTransportException: [es-prod-es-rack1-ml-0][10.42.3.200:9300][internal:cluster/coordination/join/validate], Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:cluster/coordination/join/validate] would be [2111119334/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [1935537744/1.8gb], new bytes reserved: [175581590/167.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=175581590/167.4mb, model_inference=0/0b, eql_sequence=0/0b, accounting=0/0b], at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:460) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:108) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:213) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:117) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:145) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:119) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:84) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:71) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1374) ~[?:?], at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1237) ~[?:?], at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1286) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?], at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?], at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?], at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?], at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?], at java.lang.Thread.run(Thread.java:833) [?:?]

* Weird issue as a Machine Learning node entered some weird state where it was continually failing to rejoin the cluster and effectively causing a denial of service against the lead controller. Eventually fixed the issue by restarting the ML node, at which point when it restarted, it worked again and the lead controller re-entered a healthy state.

But given that there is a verity of possible problems that can cause a temporary loss of a node, I'd like something that is more of a "band aid" to the searchable snapshot reallocating, while the root issue gets investigated/fixed.

The ML node was out of memory, or at least near enough that it couldn't receive the incoming cluster state it needed in order to join the cluster. That sounds like a potential bug too, we'd probably need a heap dump to see why it wasn't freeing things up. By "lead controller" do you mean the elected master? If so, that shouldn't have any bad effects on the master, what do you mean "denial of service"?

There shouldn't be a reason for a healthy node to leave the cluster, please report bugs like this if you find them.

Hi @DavidTurner,

By "lead controller" do you mean the elected master?

Yes, referring to elected master here.

what do you mean "denial of service"?

Part 1:

Looking at the logs of the cluster, in the ~35 minutes that the ML node was in the bad state it appears to have tried to rejoin the cluster ~600 times per minute (~21k requests total) (with it hitting the each port in the range of 9300 - 9305, so ~100 attempts per minute per port).

Which (and I assuming here), that this is probably an unhealthy amount of attempts to an elected master. (If I'm wrong here, please correct me).

Part 2

Shortly after the ML node when out it started to transfer the ML jobs to other nodes, shortly after this starts I start to see logs like:

address [10.42.0.114:9300], node [{es-prod-es-rack1-controller-0}{P1QLd6HBS4qh2fP34PwjVg}{hAJDF5x3QUeuZpbjOqoP1g}{10.42.0.114}{10.42.0.114:9300}{m}{k8s_node_name=k8s01-es, xpack.installed=true, zone=rack1, transform.node=false}], requesting [false] peers request failed
org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-prod-es-rack1-controller-0][10.42.0.114:9300][internal:discovery/request_peers] request_id [5434414] timed out after [3002ms]
org.elasticsearch.discovery.MasterNotDiscoveredException: null, at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.0.jar:7.17.0], at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?], at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?], at java.lang.Thread.run(Thread.java:833) [?:?]

(Assuming here again), but it appears that on the transfer of ML jobs from one node to another on a node crash and not recovering, it creates some sort of denial of service/thundering herd problem.

A note: These issues/logs only affected the cold nodes, transform nodes, and ml nodes. The content/hot nodes and warm nodes of the cluster did not have these timeout issues.

It's a bit wasteful but I wouldn't expect it to have any bad effects beyond the noise in the logs.

Certainly some kind of overload situation, maybe on the masters or maybe on the ML nodes (or maybe the network between them). Hard to say from these tiny fragments of logs for sure, although even full logs might not be very helpful, typically you need to grab stack dumps or profiles while things are going wrong.

I opened a bug report about the join loop.

Yeah, unfortunately didn't think of running a heap dump on the offending ML node before rebooting it, but I will make sure to note that for future issues.

Regarding logs, I have all logs from the cluster from INFO -> FATAL + the default audit logs, so if you think anything could be useful, I should definitely be able to provide it.

Looking at the resource usage graphs, it looks like right before the issue, Rack 1 Controller 0 (original elected master) CPU spiked, before metrics stop being collectable:

After the metrics started working, Rack 5 Controller 0 appears to have taken over as elected, with a higher CPU usage than before:

So, I suspect that the Rack 1 Controller 0 got overloaded and stopped responding, causing a new election of Rack 5 controller 0

Sure, I'll take a look at the logs in case anything obvious jumps out. Would you share them on https://gist.github.com?

Hi @DavidTurner gist doesn't like the size of the logs (~36k event, ~250mb size). Any place I could upload the logs/a zip of the logs? (Excluding audit events which are literally 1 million events in the issue timeframe).

Note: I can also open a ticket with support if you think that is a better route.

Yes at that scale a support ticket might be best.

Will do, thanks for the help.

Btw, following back up on the original post. Does that setting work with searchable snapshots, or is there currently no way to delay re-allocation for them?

It only affects replicas, but typically you don't have replicas for searchable snapshots. If you did, it would work on them.

Ok, thanks, would it make sense to open a feature request for something like this?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.