Shard has exceeded the maximum number of retries

After moving one of our clusters from ES 6.4 to 7.5,we have been seeing frequent instances of shards failing to allocate because they hit the max of 5 retries.

"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-01-19T04:02:46.344Z], failed_attempts[5]...

The reason given for the failure generally ends up being circuit breaker related (since the change to include real memory usage in the parent circuit breaker, we have been making adjustments, but circuits are still tripped occasionally).

The circuit breakers are a separate issue, my question for now is: Is there something I can do to remove the need for manual intervention (calling /_cluster/reroute?retry_failed=true) here?

I appreciate that the cluster is trying to protect itself to maintain stability, which is why I have been trying to avoid turning off the new parent circuit breaker logic; however, there's not much the cluster could do that would be worse than just silently, permanently giving up on recovering a replica. That sets us up to lose data if someone is not closely paying attention and is there to force the retry.

Increasing the retry count does not seem like a real fix. I do not know how quickly the retries happen, but whenever I see this issue it generally is affecting several shards at once. I would much prefer it if I could tell the cluster to wait a while and retry again later. Or something along those lines.

Edit: I should note that calling "/_cluster/reroute?retry_failed=true" always seems to do the trick on the first try.

Ur case also happen in our cluster's rolling upgrade from 6.8 to 7.4.

And so many reasons will make it happens, and it's hard to say which one is related to urs, u can check these carefully on ur cluster and choose one of the solution to make it green again:

1、 Too many open files (our data node are setting the file handler limit to over 1,000,000 and dont set to ulimit, and 65535(es official reference says) is not enough in some case)

Solution:Setting a higher value

2、Different main version(6.x and 7.x) nodes in a same cluster.(which will result in some shards can not be relocated to some nodes)

Solution:Just wait untill rolling upgrade is finished to all nodes

3、Translog or data version checked failed(which will result in data inconsistent between primary and replica shard)

Solution:Setting index setting index.number_of_replicas to "0", then wait for cluster become green again, then setting it back to ur original value (May result in data loss).

Wish it will help u!

Thanks for the suggestions!

I actually am able to make it green again, my worry is more that I keep seeing this happen randomly, and each time it requires some manual action to recover those shards.

When I run "GET _cluster/allocation/explain?pretty", the response tells me that the max_retry being reached was the sole reason for some of the nodes in the cluster not being allocated to. Those nodes are on the correct version of ES, and I believe a circuit breaker is to blame for the failure.

Your first suggestion about changing the ulimt is something I am unsure of. Our cluster is running on Windows, and the ES documentation on ulimit specifically calls out Linux. I do not know if there is a windows equivalent.

I am starting to think that it may make more sense for me to target the core issue of circuit breakers being tripped frequently. Ever since moving to 7.5 we have had issues with the parent circuit breaker (which I figured was just due to it considering real memory usage). But at this point I have given all of the nodes much more memory (all the way up to 62GB despite general advice to stay below 32GB), which helped... but I still see breakers being tripped occasionally.

      "node_id" : "xIrg7g1TRtiOLmAhJzKuoA",
      "node_name" : "",
      "transport_address" : "",
      "node_attributes" : {
        "scale_unit" : "31"
      "node_decision" : "no",
      "deciders" : [
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-01-23T00:44:33.772Z], failed_attempts[5], failed_nodes[[vyrHfX2bTdOn-nVHPsm8Vw, HCGRibcGSS2sd34_LjQ9Ag]], delayed=false, details[failed shard on node [HCGRibcGSS2sd34_LjQ9Ag]: failed recovery, failure RecoveryFailedException[[indexname][29]: Recovery failed from {node1}{_DxQwa6QTw2WRF6qzRqBrQ}{4MNvXtDkQyef1xO3U0D50A}{node1}{}{d}{scale_unit=15} into {node2}{HCGRibcGSS2sd34_LjQ9Ag}{Nc4bONwTSwCv9YH1uMgUlw}{node2}{}{d}{scale_unit=17}]; nested: RemoteTransportException[[node1][][internal:index/shard/recovery/start_recovery]]; nested: RecoveryEngineException[Phase[1] prepare target for translog failed]; nested: RemoteTransportException[[node2][][internal:index/shard/recovery/prepare_translog]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [63725814370/59.3gb], which is larger than the limit of [63243393433/58.8gb], real usage: [63725814080/59.3gb], new bytes reserved: [290/290b], usages [request=0/0b, fielddata=689861/673.6kb, in_flight_requests=290/290b, accounting=630902452/601.6mb]]; ], allocation_status[no_attempt]]]"

Is a seperate thread, about a different 7.5 issue, it was determined that some ES calculations were being performed incorrectly due to the presence of many nested fields. I am wondering if that could also explain the greatly increased memory pressure we have been seeing since the upgrade.

Ahh, seem that the actual reason is the unpredictable mem usage rise after upgrade!

That make me remember some discuss, about the jvm setting that trigger the breaker happens

  1. parent-circuit-breaker-calculation-seems-to-be-wrong-with-version-7-x
  2. circuitbreakingexception-parent-data-too-large-in-es-7-x
  3. g1gc-cause-circuitbreakingexception-parent-data-too-large-on-7-1-1

And which also make breaker occasionally happened in our cluster after upgrade to 7.4.

By then we change the jvm default setting to the bellow settings in our tsdb use case:
And breakers (to us are caused by indices.breaker.request) never happen again!


# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space


## Expert settings
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing

## GC configuration
# # # # # # -XX:+UseConcMarkSweepGC
# # # # # # -XX:CMSInitiatingOccupancyFraction=75
# # # # # # -XX:+UseCMSInitiatingOccupancyOnly

## G1GC Configuration
# NOTE: G1GC is only supported on JDK version 10 or later.
# To use G1GC uncomment the lines below.
## optimizations
10-:-XX:InitiatingHeapOccupancyPercent=30       ##  <---- main reason, don't set it higher than 40
10-:-XX:MaxGCPauseMillis=5                                   ##  <---- let gc happens more frequently
10-:-XX:GCPauseIntervalMillis=10                           ##  <---- let gc happens more frequently

## optimizations

Wish it can fix ur problem!

The heap on this node looks to be maxed out, and that is indeed the thing that needs addressing:

It's possible that your heap usage is a lot higher than it was in 6.x, but it's also possible that you were maxing out the heap there too since in 6.x we didn't circuit-break on the total heap usage.

It would be useful to see your current GC config. Can you correlate these periods of high heap usage with anything else (e.g. someone performing a heavy search on these nodes)? Are these allocation failures happening when the index is created, or are they recovering replicas after a failure? IIRC you are doing time-based indexing; do you have separate "hot" nodes that focus on the indexing and "warm" nodes to hold past indices, and are you moving data to the warm tier fast enough?

Have you considered running a larger number of smaller nodes instead? Larger heaps are correspondingly more expensive to GC; it's possible that more+smaller nodes would keep up with your (substantial) ingest load better.

To answer your original question, no, today if you hit the maximum number of retries for a shard then manual intervention is required. There's an improvement in the works to retry indefinitely with backoff, but we don't have a target release for that yet.

Back in 6.x, we maxed out the jvm heap at 30 GB as recommended. It is certainly possible that it was getting close to max frequently back then... but now in 7.5 we have increased out jvm heap to 62 GB, along with some other optimizations to try and decrease pressure on the nodes. After that, it is surprising to em that we are still tripping breakers.

Not really. The pattern I have gotten used to seeing is leaving late at night to a green cluster, and then coming in early in the morning to see 8 or so of these failed shards keeping the cluster yellow. Ingestion continues (though due to traffic patterns, it is at a low point), while the querying rate would be essentially 0.

They are recovering replicas after a failure.

You are correct about out indexing being time-based. Over the past few years we have not had a need to get to the level of sophistication where we have hot and warm nodes. Although I do expect it would improve performance, I do not think it is something I can do in the short-term.

In our current production environment, our SKU selection is somewhat limited. Back on 4.x/5.x/6.x, we stayed at 30 GB jvm heap despite having 128 GB memory available. Which meant that only our Data nodes (we have dedicated nodes for data, master, querying, and writing) made good use of the available memory. When I saw the parent circuit breakers being tripped in 7.5, I decided to start experimenting. Rather than toggle off the new feature where real memory usage is considered, I wanted to try to get them to stop breaking by relieving that memory pressure. Which is why I went past that 30 GB boundary.

Also, I'll say that currently it is only data nodes whose breakers are tripping, and I do not really understand why we would need more data nodes than we already have. our shard sizes are under 10 GB, each nodes has ~100 shards, and I have a rule that says no node can have more than 1 shard for the index that is currently being written to. If there were gains to be gotten from adding more data nodes, I imagine we would need to also add more shards, which seems counter-productive.

Last night I set the retry limit up to 10, but as I expected, it appears to have been hit just as much as the limit of 5 was.

Changing machine SKUs is not really an option at the moment, and we are not quite ready to run with hot/warm nodes. But I am up for trying any other suggestion. If you think that more data nodes/smaller shards would help, I could do that easily enough. Or, even though it is the data nodes whose breakers are tripping, maybe adding more ingestion nodes would help somehow? Or if you think I can just reduce the jvm heap on those data nodes (though, before I gave them more heap, they were tripping more frequently).

I think my last resort will be turning off the setting to have parent circuit breakers consider real memory usage, but I'd really rather not do that. All of the other issues I've worked through to get the circuit breakers to stop tripping have been resolved. This is the only one that is left, and it seems like such a trivial one to have to give up on.

Sorry for the ugly formatting. Our current system generates jvm options on the fly at startup and passes them in as parameters.

-XX:+AlwaysPreTouch, -Xss1m, 
-Djna.nosys=true, -Djava.awt.headless=true, 

Ahhh. That looks very relevant. I did not realize that it was rejecting requests based on if they would trigger GC. Your suggestion to alter those GC settings sounds like a very reasonable solution.

That isn't the case. The real memory circuit breaker triggers when the JVM reports its heap usage is over 95% of the maximum. I don't think we have a way to tell whether something would trigger GC or not.

However, I think the crucial change to the default GC config is the below PR, and it looks like your config has not incorporated that change:

I.e. the recommendations are now as follows:


(It's possible that this isn't the only issue here, but it seems wise to try that as a first step)

It seems odd that we would hit these exceptions when your system is under light load, but I'm not at all a GC expert so maybe this is a consequence of the settings mentioned above. Just to check, have you configured things to permit more concurrent recoveries than the default of 2? I guess if there were hundreds of concurrent recoveries then that might itself result in overload, but I'm still a bit surprised.

Note that you can run more than one node on a single host. It could be that your hosts are powerful enough to support two nodes, each with a heap that's still within recommendations (i.e. small enough to reap the benefits of compressed oops). Something to consider if you can't get the heap size back down again with your current node count.

I've applied those new defaults, and am now monitoring our cluster to see what impact it has.

I agree, although, we have our own monitoring for things like jvm heap usage over time, and I can tell you that when allowed to by the GC rules, the heap would regularly go up to the top of its bounds. Which, I guess means the number that really matters is the distance between the point just before GC kicks in, and the circuit breaker threshold. With enough nodes in the cluster, I guess that happens often enough. I'm not quite sure the chain of events that would happen to then take it over the circuit breaker threshold, but at leas tin our cluster, it does not seem to take much.

Moving that limit above 2 is something we have been doing with our clusters for several years now. Our clusters have 10's of thousands of shards, and recovering 2 at a time seems like an unreasonably low number. In other clusters, I think we have raised that number as high as 20, while in this new one I only have it at 15, and for awhile I had it at 10 (with no noticeable difference to the problem at hand).

I do think you are right, and that is something I would like to do at some point. But with our current architecture, modifying it to run two nodes on one box is just difficult enough to not make it as a priority task. TBH, that we are not fully utilizing these machines is not actually a pressing issue, but as I was experimenting with this new cluster that it would be wroth while to see what giving the JVM a larger heap would do for us. I assumed that the extra memory would overcome any disadvantages from more expensive GC operations.

On that note, I am not quite understanding how more nodes would help our data nodes not trip their circuit breakers (our ingestion nodes are already at the point where they do not trip circuit breakers any more). As I understand it, our shard sizes and counts per node are both far under the recommended limits. If I were to add more nodes, I can't imagine I would want to decrease the shard sizes even further. It would naturally decrease the number of shards per node, but if having 100 5 GB shards per node is too much for a 7.5 cluster, then I feel like it's a backwards step form 6.x.

My main concern is that if the heap is too large to make use of compressed oops then all ordinary object pointers grow to twice the size, and there are a lot of these pointers in play. In practical terms two JVMs both with 30GB heaps are, in total, significantly larger than a single JVM with a 60GB heap.

Let's wait to see the effects of the change to the GC settings, however, because I think they will help enormously and will hopefully let you reduce your nodes' heaps back down anyway.

I share that concern. Though, currently my memory problems are on my data nodes, and if I just added more I am not sure how it would give me room to reduce memory usage unless I also reduced the shard sizes (unless I am relying on each node having fewer of these shards, but their shard counts are low enough that I would not expect it to be a problem). It feels like there is something specific about our data that is causing a miscalculation of some sort. The even with fairly small shards, and not that many of them, we still see circuit breakers randomly tripped.

I ran a test over the weekend with the updated GC settings, and it did seem to perform much better. However, this morning I saw a circuit breaker trip (just one, which is far fewer than the dozen or so I would see before I made those changes), and a shard get stuck unassigned because of it:

nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [65273877818/60.7gb], which is larger than the limit of [63243393433/58.8gb], real usage: [65273876872/60.7gb], new bytes reserved: [946/946b], usages [request=0/0b, fielddata=2454646/2.3mb, in_flight_requests=1390/1.3kb, accounting=636904596/607.3mb]]; ], allocation_status[no_attempt]]]

Also, I'll say that the JVM memory usage patterns for my data nodes are not what I expected. On my ingestion nodes the change had a big effect. I updated the GC settings, and those nodes stopped ever going over ~30% JVM Memory Usage. However, on my data nodes, the average ceiling for JVM memory usage only went down from ~85% (with peaks over 90%) to ~70% (with peaks over 85%). And for some reason, my master nodes do not really seem to have changed much form how they acted before, I am still seeing GC happen around ~65% for the active master.

Does this mean that my Data nodes are being given such big tasks that it is still enough on occasion for one big job to push it over the circuit breaker threshold? It is hard to imagine that the same workload which we have running in parallel on a 6.4 cluster (in that cluster we still have only 30 GB or memory, and actually have a heavier workload) uses that much more memory in 7.5.

What can I do to reduce the max percentage of JVM heap these nodes will use? It is hard to imagine that this is a propblem created by them having too much memory available, but if you think that might be it, I could certainly revert them to 30 GB. It is counter-intuitive to me to reduce shard sizes, or add more nodes at this point, but I would also be willing to try either if you think it might help. Perhaps Data nodes are different form other node types, and they are more prone to expand to use up as much JVM heap as possible?

Above, DeeeFox had suggested altering two settings that you had not mentioned. XX:MaxGCPauseMillis and XX:GCPauseIntervalMillis. Would you recommend that I give further GC tuning a try, or is there another approach we could take?

As I said, those changes do appear to have made tripped circuit breakers less common. Though that is a bit of a double-edged sword, since this last time it took almost 3 days for a breaker to be tripped. So it is getting harder to verify.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.