Mounted Searchable Snapshot Delay Reallocation/Recovery

BenB196 · March 15, 2022, 3:39pm

Hi All,

I'm using searchable snapshots on a cluster and was wondering if there is a way to delay whether a searchable snapshot gets reallocated for a period of time.

Example problem, if I have a cold node which gets hit by a hard query, the CPU can spike to 100% and "lock" the node for more than a minute. This is long enough for the Elasticsearch cluster to think that the node is "dead" and begins the recovery/reallocation process of mounted searchable snapshots for that node. This is bad for me as a single node can be 10TB of data that gets needlessly reshuffled.

I'd like to be able to do something, where I can tell Elasticsearch that if the node becomes unavailable, wait for X minutes before trying to reallocat/recover searchable snapshot indices.

I found Delaying allocation when a node leaves, and am thinking of doing something like:

PUT /restored-*/_settings?timeout=60s
{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "10m"
  }
}

To apply to all searchable snapshot indices.

But what I can't find is whether this would work like this, as there really isn't any mention of how searchable snapshots interact with this setting.

Would anyone know if this would work, or if there is another way to achieve this?

DavidTurner · March 15, 2022, 4:48pm

Seems to me that it would be better to address this fundamental problem instead of just improving how Elasticsearch reacts to it. It should definitely be able to distinguish a busy node from a dead one. What version are you using? What do you see in the logs at this time?

BenB196 · March 15, 2022, 5:17pm

Hi @DavidTurner I'm currently using 7.17.0.

The problem I provided was more of just an example, of an issue I saw in the past, so I don't have direct logs in this type of scenario.

But there could also be some other issues which cause this reallocation problem, i.e.: minor network fault between a node or a node crashing.

A more recent issue that I've seen cause searchable snapshots is actually kind of weird*, but the leading controller node started to reject requests because of a memory circuit breaker, which eventually caused a cold node to drop and therefore searchable snapshots to reallocate.

Here is an example of the event (I can provide more logs if wanted):

Controller Log

failed to validate incoming join request from node [{es-prod-es-rack1-ml-0}{XnfbpS0lROeWDqgcIwQjXQ}{UalEL3mYSnacUfsWCczQng}{10.42.3.200}{10.42.3.200:9300}{lr}{k8s_node_name=k8s02-es, ml.machine_memory=32212254720, xpack.installed=true, zone=rack1, transform.node=false, ml.max_open_jobs=512, ml.max_jvm_size=2147483648}]

Controller Stacktrace

org.elasticsearch.transport.RemoteTransportException: [es-prod-es-rack1-ml-0][10.42.3.200:9300][internal:cluster/coordination/join/validate], Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:cluster/coordination/join/validate] would be [2111119334/1.9gb], which is larger than the limit of [2040109465/1.8gb], real usage: [1935537744/1.8gb], new bytes reserved: [175581590/167.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=175581590/167.4mb, model_inference=0/0b, eql_sequence=0/0b, accounting=0/0b], at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:460) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:108) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:213) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:117) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:145) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:119) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:84) ~[elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:71) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1374) ~[?:?], at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1237) ~[?:?], at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1286) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[?:?], at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?], at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?], at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?], at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?], at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620) ~[?:?], at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583) ~[?:?], at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?], at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?], at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?], at java.lang.Thread.run(Thread.java:833) [?:?]

* Weird issue as a Machine Learning node entered some weird state where it was continually failing to rejoin the cluster and effectively causing a denial of service against the lead controller. Eventually fixed the issue by restarting the ML node, at which point when it restarted, it worked again and the lead controller re-entered a healthy state.

But given that there is a verity of possible problems that can cause a temporary loss of a node, I'd like something that is more of a "band aid" to the searchable snapshot reallocating, while the root issue gets investigated/fixed.

DavidTurner · March 15, 2022, 5:26pm

The ML node was out of memory, or at least near enough that it couldn't receive the incoming cluster state it needed in order to join the cluster. That sounds like a potential bug too, we'd probably need a heap dump to see why it wasn't freeing things up. By "lead controller" do you mean the elected master? If so, that shouldn't have any bad effects on the master, what do you mean "denial of service"?

There shouldn't be a reason for a healthy node to leave the cluster, please report bugs like this if you find them.

BenB196 · March 15, 2022, 5:52pm

Hi @DavidTurner,

By "lead controller" do you mean the elected master?

Yes, referring to elected master here.

what do you mean "denial of service"?

Part 1:

Looking at the logs of the cluster, in the ~35 minutes that the ML node was in the bad state it appears to have tried to rejoin the cluster ~600 times per minute (~21k requests total) (with it hitting the each port in the range of 9300 - 9305, so ~100 attempts per minute per port).

Which (and I assuming here), that this is probably an unhealthy amount of attempts to an elected master. (If I'm wrong here, please correct me).

Part 2

Shortly after the ML node when out it started to transfer the ML jobs to other nodes, shortly after this starts I start to see logs like:

address [10.42.0.114:9300], node [{es-prod-es-rack1-controller-0}{P1QLd6HBS4qh2fP34PwjVg}{hAJDF5x3QUeuZpbjOqoP1g}{10.42.0.114}{10.42.0.114:9300}{m}{k8s_node_name=k8s01-es, xpack.installed=true, zone=rack1, transform.node=false}], requesting [false] peers request failed

org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-prod-es-rack1-controller-0][10.42.0.114:9300][internal:discovery/request_peers] request_id [5434414] timed out after [3002ms]

org.elasticsearch.discovery.MasterNotDiscoveredException: null, at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onTimeout(TransportMasterNodeAction.java:297) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onTimeout(ClusterStateObserver.java:345) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onTimeout(ClusterStateObserver.java:263) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.cluster.service.ClusterApplierService$NotifyTimeout.run(ClusterApplierService.java:660) [elasticsearch-7.17.0.jar:7.17.0], at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:718) [elasticsearch-7.17.0.jar:7.17.0], at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?], at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?], at java.lang.Thread.run(Thread.java:833) [?:?]

(Assuming here again), but it appears that on the transfer of ML jobs from one node to another on a node crash and not recovering, it creates some sort of denial of service/thundering herd problem.

A note: These issues/logs only affected the cold nodes, transform nodes, and ml nodes. The content/hot nodes and warm nodes of the cluster did not have these timeout issues.

DavidTurner · March 15, 2022, 6:19pm

It's a bit wasteful but I wouldn't expect it to have any bad effects beyond the noise in the logs.

Certainly some kind of overload situation, maybe on the masters or maybe on the ML nodes (or maybe the network between them). Hard to say from these tiny fragments of logs for sure, although even full logs might not be very helpful, typically you need to grab stack dumps or profiles while things are going wrong.

DavidTurner · March 15, 2022, 6:27pm

I opened a bug report about the join loop.

github.com/elastic/elasticsearch

Join validation has a tight retry loop if the joining node hits parent circuit breaker

opened 06:26PM - 15 Mar 22 UTC

DaveCTurner

>bug :Distributed/Cluster Coordination

### Elasticsearch Version 8.1 and many more ### Installed Plugins N/A ### Ja…va Version _bundled_ ### OS Version N/A ### Problem Description If a joining node has heap usage in excess of the parent circuit breaker then it will still request to join the cluster. The master will send it a copy of the cluster state for validation which it'll immediately reject with a `CircuitBreakingException`, failing the join, and triggering a pretty-much-immediate retry. ### Steps to Reproduce Increase sensitivity of parent circuit breaker on a node and observe that it tries and fails to join the cluster in a tight loop. ### Logs (if relevant) _No response_

BenB196 · March 15, 2022, 6:28pm

Yeah, unfortunately didn't think of running a heap dump on the offending ML node before rebooting it, but I will make sure to note that for future issues.

Regarding logs, I have all logs from the cluster from INFO -> FATAL + the default audit logs, so if you think anything could be useful, I should definitely be able to provide it.

Looking at the resource usage graphs, it looks like right before the issue, Rack 1 Controller 0 (original elected master) CPU spiked, before metrics stop being collectable:

After the metrics started working, Rack 5 Controller 0 appears to have taken over as elected, with a higher CPU usage than before:

So, I suspect that the Rack 1 Controller 0 got overloaded and stopped responding, causing a new election of Rack 5 controller 0

DavidTurner · March 15, 2022, 6:29pm

Sure, I'll take a look at the logs in case anything obvious jumps out. Would you share them on https://gist.github.com?

BenB196 · March 15, 2022, 7:02pm

Hi @DavidTurner gist doesn't like the size of the logs (~36k event, ~250mb size). Any place I could upload the logs/a zip of the logs? (Excluding audit events which are literally 1 million events in the issue timeframe).

Note: I can also open a ticket with support if you think that is a better route.

DavidTurner · March 15, 2022, 7:03pm

Yes at that scale a support ticket might be best.

BenB196 · March 15, 2022, 7:05pm

Will do, thanks for the help.

Btw, following back up on the original post. Does that setting work with searchable snapshots, or is there currently no way to delay re-allocation for them?

DavidTurner · March 15, 2022, 7:06pm

It only affects replicas, but typically you don't have replicas for searchable snapshots. If you did, it would work on them.

BenB196 · March 15, 2022, 7:07pm

Ok, thanks, would it make sense to open a feature request for something like this?

system · April 12, 2022, 7:07pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster questions Elasticsearch	7	373	July 6, 2017
When one node goes down, memory usage jumps several gigabytes on other nodes Elasticsearch	7	561	July 6, 2017
Can't stop a snapshot running on my cluster Elasticsearch	10	4607	July 6, 2017
Allocation awareness and node replacement Elasticsearch	4	950	July 6, 2017
Trying to optimize configuration for better cluster restart/recovery Elasticsearch	8	647	July 6, 2017

Mounted Searchable Snapshot Delay Reallocation/Recovery

Related topics