Why move shard to unassigned when the circuit breaker is open?

I find that the replica shard is moved to unassigned, when write to replica and trip to circuit-brek if i used Circuit-break based on real memory usage.

This will result in many started shards being moved back to the unallocated queue. Due to insufficient memory, they will no longer be restarted.

In addition, if some specified exceptions**(eg. ShardNotFoundException、IndexNotFoundException...)** cause the copy write failure, it seems that no request**(internal:cluster/shard/failure)** will be sent to the master node.

Why does this exception**(CircuitBreakingException)** not need to be handled specially? Instead, it moves the copy to the unallocated state?

image

image

Hi @cigarl,

it could be good to know the specific version of Elasticsearch you are on. Also, could you include the classnames of the code snippets (or github links), I am having trouble guessing at the exact pieces of code you are referring to.

The circuit breaker exception will need to fail the shard, since if we did not fail the shard, the replica would be missing an operation. We could retry, but this could escalate the memory issue to other nodes if we are not careful.

Other users have reported circuit breaker issues if using G1 GC with real memory circuit breaker. If you are using G1 GC, please double check that your jvm settings look similar to what is in this pull request: https://github.com/elastic/elasticsearch/pull/46169

Thank you for your answer. This part has answered my doubts.

Because I don't require consistency of replica data, I think that I can try to modify this part of logic by myself.

Thanks again

This sounds risky and will make it harder to get help later on. I would recommend try tuning the cluster to avoid the circuit reakers triggering instead if possible.

Modifying that part of the code to simply ignore the exception will not work out, since the global checkpoint will no longer be able to progress, causing a myriad of other issues (memory usage, incompatibility with other components and more).

I would recommend to look into the cause of the circuit breaker exceptions by looking into GC, request sizes etc. You can also consider scaling your deployment either horizontally or vertically. Circuit breakers are there to avoid worse problems, but they still signal an overload situation.

1 Like

Thank you for your reminding.

At present, I am still in the verification stage, not necessarily.

Thank you for your reply.

Now I'm just testing, not necessarily.

Maybe I should think about that why memory is used so much. Segment memory takes up a lot. But according to my experience, this problem seems to be abnormal, because my segment files are not large.

The parameters of GC remain unchanged in several environments, but not all environments have the problem of high segment memory.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.