Unfollowing indices when leader is gone

Hello,

I am currently testing ccr, and staging a particular scenario where my primary site is lost.

I need to script a procedure to change all my followers to R/W and resume the operations on the secondary site.
To achieve this, i loop over all followers doing pause_follower/close/unfollow/open

During this test I found two different behaviours when unfollowing indices when the leader is not available:

  1. For some indices the unfollow call never returns (i waited for more than an hour), so i implement a timeout for this case.
    The cluster shows the unfollow task running but never dies.

     curl --max-time 120  -X POST secondary-cluster:9200/indice-2019.01/_ccr/unfollow?pretty 
     curl: (28) Operation timed out after 120001 milliseconds with 0 out of -1 bytes received
    

and _cat/tasks :

action                         task_id                      parent_task_id               type      start_time    timestamp running_time ip            node
indices:admin/xpack/ccr/unfollow HNTVVJ9-Qemfv9Vq4sImng:3976    -                              transport 1582919983003 19:59:43 2.6d        192.168.1.YYY node1b
  1. The unfollow call returns immediately with an error

     curl --max-time 120  -X POST "secondary-cluster:9200/indice-2019.01d/_ccr/unfollow?pretty" 
     {
       "error" : {
         "root_cause" : [
           {
             "type" : "connect_transport_exception",
             "reason" : "[][192.168.1.XXX:9300] connect_exception"
           }
         ],
         "type" : "exception",
         "reason" : "ConnectTransportException[[][192.168.1.XXX:9300] connect_exception]; nested: AnnotatedConnectException[Connection refused: /192.168.1.XXX:9300]; nested: ConnectException[Connection refused];",
         "failed_to_remove_retention_leases" : "secondary-cluster/indice-2019.01d/WmQ9QlAgRtS8iqE1tXIY5Q-following-remote-prod/indice-2019.01d/X_dOSvl_Taadxq-yutWuaQ",
         "caused_by" : {
           "type" : "connect_transport_exception",
           "reason" : "[][192.168.1.XXX:9300] connect_exception",
           "caused_by" : {
             "type" : "annotated_connect_exception",
             "reason" : "Connection refused: /192.168.1.XXX:9300",
             "caused_by" : {
               "type" : "connect_exception",
               "reason" : "Connection refused"
             }
           }
         }
       },
       "status" : 500
     }
    

In both cases _ccr/stats shows no followers and if I
open the index , everything seems ok and ready to R/W operations.

Should I be concerned about this long running tasks? Is this the expected behaviour when unfollowing if the leader is not available?

ES Version: 7.4.2 (tgz distribution)

Thanks for your help.
Regards.

Hmm. I can imagine things for which unfollowing might be waiting (in vain) but I think you're right and it shouldn't be. Any chance you can try and reproduce this on 7.6, just in case it's something that's been fixed? If it persists in the latest version, would you open an issue on Github about this?

Thanks for your reply,

I reproduced this behaviour in version 7.6.0 , also realized that the unfollow task hangs only with indices with more than one primary shard. With 1 primary it returns de exception

I will run a few more tests and open a ticket.

For reference:

Open ticket: https://github.com/elastic/elasticsearch/issues/53174
Closed in: https://github.com/elastic/elasticsearch/pull/53262

1 Like

Possibly one of the best-written bug reports I've ever read, thanks @javierE :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.