Why move shard to unassigned when the circuit breaker is open?

cigarl · December 17, 2019, 3:26am

I find that the replica shard is moved to unassigned, when write to replica and trip to circuit-brek if i used Circuit-break based on real memory usage.

This will result in many started shards being moved back to the unallocated queue. Due to insufficient memory, they will no longer be restarted.

In addition, if some specified exceptions**(eg. ShardNotFoundException、IndexNotFoundException...)** cause the copy write failure, it seems that no request**(internal:cluster/shard/failure)** will be sent to the master node.

Why does this exception**(CircuitBreakingException)** not need to be handled specially? Instead, it moves the copy to the unallocated state?

HenningAndersen · December 17, 2019, 12:44pm

Hi @cigarl,

it could be good to know the specific version of Elasticsearch you are on. Also, could you include the classnames of the code snippets (or github links), I am having trouble guessing at the exact pieces of code you are referring to.

The circuit breaker exception will need to fail the shard, since if we did not fail the shard, the replica would be missing an operation. We could retry, but this could escalate the memory issue to other nodes if we are not careful.

Other users have reported circuit breaker issues if using G1 GC with real memory circuit breaker. If you are using G1 GC, please double check that your jvm settings look similar to what is in this pull request: https://github.com/elastic/elasticsearch/pull/46169

cigarl · December 18, 2019, 3:14am

Thank you for your answer. This part has answered my doubts.

Because I don't require consistency of replica data, I think that I can try to modify this part of logic by myself.

Thanks again！

Christian_Dahlqvist · December 18, 2019, 6:53am

This sounds risky and will make it harder to get help later on. I would recommend try tuning the cluster to avoid the circuit reakers triggering instead if possible.

HenningAndersen · December 18, 2019, 11:30am

Modifying that part of the code to simply ignore the exception will not work out, since the global checkpoint will no longer be able to progress, causing a myriad of other issues (memory usage, incompatibility with other components and more).

I would recommend to look into the cause of the circuit breaker exceptions by looking into GC, request sizes etc. You can also consider scaling your deployment either horizontally or vertically. Circuit breakers are there to avoid worse problems, but they still signal an overload situation.

cigarl · December 19, 2019, 1:50am

Thank you for your reminding.

At present, I am still in the verification stage, not necessarily.

cigarl · December 19, 2019, 2:04am

Thank you for your reply.

Now I'm just testing, not necessarily.

Maybe I should think about that why memory is used so much. Segment memory takes up a lot. But according to my experience, this problem seems to be abnormal, because my segment files are not large.

The parameters of GC remain unchanged in several environments, but not all environments have the problem of high segment memory.

system · January 16, 2020, 2:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Circuit Breaker Tripping During Shard Relocation Elasticsearch	1	913	April 24, 2018
ES 7.0.1 : Unassigned Shards : Clarifications on how reroute API with retry_failed parameter works and its side-effects Elasticsearch	6	1528	March 20, 2020
Circuit_breaking_exception during reindex Elasticsearch	22	4902	November 20, 2019
[7.0.1] New Circuitbreaker - Segment bigger than heap. Should this break? Elasticsearch	10	1292	September 27, 2019
Shard has exceeded the maximum number of retries Elasticsearch	13	11823	February 24, 2020

Why move shard to unassigned when the circuit breaker is open?

Related topics