CCR request circuit breakers error

Hi,

We are using the CCR feature as part of ours process to migrate our data between two cluster, but we are experimenting some unexpected behaviour in the remote cluster.

Once begin to follow an index in the remote cluster report this warning:

{"type": "server", "timestamp": "2021-12-23T10:27:41,495Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "remote-cluster", "node.name": "node-1", "message": "[request] New used memory 7689662394 [7.1gb] for data of [<reduce_aggs>] would be larger than configured breaker: 6442450944 [6gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn9H", "node.id": "i-5Ilv6VJOSg6zWrm0IjVT"  }
{"type": "server", "timestamp": "2021-12-23T10:28:41,564Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "remote-cluster", "node.name": "node-1", "message": "[request] New used memory 9933680952 [9.2gb] for data of [preallocate[aggregations]] would be larger than configured breaker: 6442450944 [6gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn9H", "node.id": "i-5Ilv6VJOSg6zWrm0IjVT"  }
{"type": "server", "timestamp": "2021-12-23T10:29:41,603Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "remote-cluster", "node.name": "node-1", "message": "[request] New used memory 12233179472 [11.3gb] for data of [preallocate[aggregations]] would be larger than configured breaker: 6442450944 [6gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn9H", "node.id": "i-5Ilv6VJOSg6zWrm0IjVT"  }
{"type": "server", "timestamp": "2021-12-23T10:30:41,615Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "remote-cluster", "node.name": "node-1", "message": "[request] New used memory 14468545192 [13.4gb] for data of [preallocate[aggregations]] would be larger than configured breaker: 6442450944 [6gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn9H", "node.id": "i-5Ilv6VJOSg6zWrm0IjVT"  }

And Kibana report the next :
error

My first conclusions was try to modified some parameter in the follow request to reduce the impact, but after some test the result was the same the error persist. I also try to set parameters like max_read_request_operation_count, max_read_request_size, max_outstanding_read_requests to minimum values to try to make a visible impact but nothing happen (i tried also with write properties) follow api
Looks like all the settings was ignored..

Ours goal is run the CCR in background with a minimum impact in the remote cluster , but this issue is blocking us totally.

Some extra info:

GET _nodes/stats/jvm?pretty&human

"jvm" : {
        "mem" : {
          "heap_used" : "2.7gb",
          "heap_used_percent" : 34,
          "heap_committed" : "8gb",
          "heap_max" : "8gb"
          ....
       }
}
GET _nodes/stats/breaker

"request" : {
          "limit_size_in_bytes" : 6442450944,
          "limit_size" : "6gb",
          "estimated_size_in_bytes" : 42801814968,
          "estimated_size" : "39.8gb",
          "overhead" : 1.0,
          "tripped" : 47
        },...

I hope someone can bring us a suggestion about how resolve this. Let me know
if you need more info.

Kind Regards :wave:.

Hi @dgarcia ,

what version of Elasticsearch are you using? It sounds like you could be affected by the memory leak fixed here, that should affect versions 7.12.0-7.14.0 (inclusive), fixed in 7.14.1 and onwards.

Hi @HenningAndersen,

We are currently on version "Elasticsearch:7.14.0" so I will be upgrading to a later version. Thanks for the reply.

Kind Regards.

Hi,

Before to find this bug i was working in migrate the indices using the CCR , to avoid the issue that we was comment i follow the next process :

  1. In the remote cluster follow the indice
  2. Wait until the indice "finish" changing the state from pause to active in the remote cluster (In this step in the logs show the error that we comment before)
  3. Free the memory in the nodes restarting 1 by 1 all the nodes affected keeping the green status of the cluster.

At this point all the follow indices looks that are working properly , but we notice that some indices don't have the same number of documents or the primary.store is different.

GET cat/indices?v=true&s=pri.store.size:desc

Remote cluster(Cluster B):
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open test_1 6 1 101246804 30397493 795.4gb 397.8gb
green open test_2 8 1 11614304 4254312 44.8gb 22.4gb

Local cluster (Cluster A):
health status index pri rep docs.count docs.deleted store.size pri.store.size
green open test_1 6 1 101632577 27669048 778.9gb 388gb
green open test_2 8 1 12146329 3602574 46.1gb 22.8gb

Is this a correct behaviour? the Cluster B should be have the exact same of number of documents and the same sizes ?

Maybe this is caused by the leak memory and will be fixed after upgrade to 7.14.1 , but i would like to be sure if CCR should be able to have the same values in both clusters.

Kind Regards.

Hi @dgarcia ,

on the surface those numbers look close enough. There is some uncertainty to the numbers wrt. concurrent indexing and refreshes (and maybe other aspects). Perhaps you can try to do:

POST test_1/_refresh
POST test_1/_count

to see if the numbers then match close enough? I assume there is concurrent indexing going on into the leader.

Update about the issue:

After update to the cluster to 7.14.1 we are experimenting the same issues mentioned, I try to follow the next index:

Cluster A (leader indices):

GET _cat/indices/?v&s=store.size:desc

health status index   uuid                    pri rep docs.count docs.deleted store.size pri.store.size
green  open  test_1   m5Ou2_EJJO6ojztPpzPLg   2   1   48941889     12241176    160.2gb         80.1gb

Cluster B (remote):

GET _cat/indices/?v&s=store.size:desc

health status index   uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   test_1  ggVOF2bTDOsrsaY5P_7XA   2   1   48941889     12369946    160.1gb           80gb

Kibana show the same error posted before and logs looks similar

Logs

{"type": "server", "timestamp": "2022-01-05T16:29:28,916Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "test", "node.name": "multipurpose-2", "message": "[request] New used memory 44057921760 [41gb] for data of [preallocate[aggregations]] would be larger than configured breaker: 5153960755 [4.7gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn7A", "node.id": "4MsVPPiOS7i-4_B-pJlM6A"  }
{"type": "server", "timestamp": "2022-01-05T16:29:28,921Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "test", "node.name": "multipurpose-2", "message": "[request] New used memory 44057855849 [41gb] for data of [<reduce_aggs>] would be larger than configured breaker: 5153960755 [4.7gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn7A", "node.id": "4MsVPPiOS7i-4_B-pJlM6A"  }
{"type": "server", "timestamp": "2022-01-05T16:30:28,932Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "test", "node.name": "multipurpose-2", "message": "[request] New used memory 44057905097 [41gb] for data of [<reduce_aggs>] would be larger than configured breaker: 5153960755 [4.7gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn7A", "node.id": "4MsVPPiOS7i-4_B-pJlM6A"  }
{"type": "server", "timestamp": "2022-01-05T16:31:28,937Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "test", "node.name": "multipurpose-2.", "message": "[request] New used memory 44057921760 [41gb] for data of [preallocate[aggregations]] would be larger than configured breaker: 5153960755 [4.7gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn7A", "node.id": "4MsVPPiOS7i-4_B-pJlM6A"  }
{"type": "server", "timestamp": "2022-01-05T16:31:28,942Z", "level": "WARN", "component": "o.e.i.b.request", "cluster.name": "test", "node.name": "multipurpose-2", "message": "[request] New used memory 44057855945 [41gb] for data of [<reduce_aggs>] would be larger than configured breaker: 5153960755 [4.7gb], breaking", "cluster.uuid": "6s-6ymZ5SZqzc8KdmOqn7A", "node.id": "4MsVPPiOS7i-4_B-pJlM6A"  }

Excepted warning logs and the Kibana error the follow index looks work properly correct and the values are correct too.

Kind Regards.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.