Hello,
I am currently testing ccr, and staging a particular scenario where my primary site is lost.
I need to script a procedure to change all my followers to R/W and resume the operations on the secondary site.
To achieve this, i loop over all followers doing pause_follower/close/unfollow/open
During this test I found two different behaviours when unfollowing indices when the leader is not available:
-
For some indices the unfollow call never returns (i waited for more than an hour), so i implement a timeout for this case.
The cluster shows the unfollow task running but never dies.curl --max-time 120 -X POST secondary-cluster:9200/indice-2019.01/_ccr/unfollow?pretty curl: (28) Operation timed out after 120001 milliseconds with 0 out of -1 bytes received
and _cat/tasks :
action task_id parent_task_id type start_time timestamp running_time ip node
indices:admin/xpack/ccr/unfollow HNTVVJ9-Qemfv9Vq4sImng:3976 - transport 1582919983003 19:59:43 2.6d 192.168.1.YYY node1b
-
The unfollow call returns immediately with an error
curl --max-time 120 -X POST "secondary-cluster:9200/indice-2019.01d/_ccr/unfollow?pretty" { "error" : { "root_cause" : [ { "type" : "connect_transport_exception", "reason" : "[][192.168.1.XXX:9300] connect_exception" } ], "type" : "exception", "reason" : "ConnectTransportException[[][192.168.1.XXX:9300] connect_exception]; nested: AnnotatedConnectException[Connection refused: /192.168.1.XXX:9300]; nested: ConnectException[Connection refused];", "failed_to_remove_retention_leases" : "secondary-cluster/indice-2019.01d/WmQ9QlAgRtS8iqE1tXIY5Q-following-remote-prod/indice-2019.01d/X_dOSvl_Taadxq-yutWuaQ", "caused_by" : { "type" : "connect_transport_exception", "reason" : "[][192.168.1.XXX:9300] connect_exception", "caused_by" : { "type" : "annotated_connect_exception", "reason" : "Connection refused: /192.168.1.XXX:9300", "caused_by" : { "type" : "connect_exception", "reason" : "Connection refused" } } } }, "status" : 500 }
In both cases _ccr/stats shows no followers and if I
open the index , everything seems ok and ready to R/W operations.
Should I be concerned about this long running tasks? Is this the expected behaviour when unfollowing if the leader is not available?
ES Version: 7.4.2 (tgz distribution)
Thanks for your help.
Regards.