NodeClosedException while auto-following indices

Hi!

So, we do have two elasticsearch environments up and running with CCR configured.
Everything works fine so far. But after checking the document count we noticed a mismatch in numbers. This only seems to affect "bigger" indices.
When digging deeper, we found out, that

_ccr/stats

returns

...
"recent_auto_follow_errors" : [
      {
        "leader_index" : "AutoFollowerPatterns",
        "timestamp" : 1564653416136,
        "auto_follow_exception" : {
          "type" : "exception",
          "reason" : "RemoteTransportException[[10.176.39.38][10.176.39.38:9300][cluster:monitor/state]]; nested: NodeClosedException[node closed {10.176.39.38}{wOe54CAlTKW47GD473DCoA}{qjM8vEWrS5ybm0D2bB8Iaw}{10.176.39.38}{10.176.39.38:9300}{rack_id=us-west-2c, xpack.installed=true, zone=us-west-2}];",
          "caused_by" : {
            "type" : "node_closed_exception",
            "reason" : "node closed {10.176.39.38}{wOe54CAlTKW47GD473DCoA}{qjM8vEWrS5ybm0D2bB8Iaw}{10.176.39.38}{10.176.39.38:9300}{rack_id=us-west-2c, xpack.installed=true, zone=us-west-2}"
          }
        }
      }
    ]
...

In the elastic logs we see the following entry:

[2019-08-01T09:56:56,590][WARN ][o.e.x.c.a.ShardFollowTasksExecutor] [10.176.68.180] [ccr-visit-global-standard-w2019.30-1024][1] background management of retention lease [elastic-useast/ccr-visit-global-standard-w2019.30-1024/FkRtqtL-Q1aujddl0UvzxA-following-elastic-uswest/visit-global-standard-w2019.30-1024/7vymksCxRdeqkz-ThYACcg] failed while following
org.elasticsearch.transport.NodeNotConnectedException: [10.176.39.38][10.176.39.38:9300] Node not connected
        at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:557) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:251) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:209) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:100) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:62) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:143) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.8.1.jar:6.8.1]
...

I'm not sure if those two messages relate to each other but at first glimpse it does.

So my question is, how can I make sure, that the missing data still gets replicated or how can I prevent this in first place?

Thanks in advance!

p.s we use Elastic 6.8.1

Can you share the CCR stats from these indices and how you compared document counts that led to this conclusion?

Thanks for your quick response Jason!

I think I made a small mistake... I just recognized that those indices still get data... so, the document count is slightly off because of that. Older indices, which don't get any data any more are fine...

Nevertheless I still got this auto follow error mentioned above.

Here are the stats:

 {
        "index" : "ccr-visit-global-standard-w2019.30-1024",
        "shards" : [
          {
            "remote_cluster" : "elastic-uswest",
            "leader_index" : "visit-global-standard-w2019.30-1024",
            "follower_index" : "ccr-visit-global-standard-w2019.30-1024",
            "shard_id" : 0,
            "leader_global_checkpoint" : 514,
            "leader_max_seq_no" : 514,
            "follower_global_checkpoint" : 514,
            "follower_max_seq_no" : 514,
            "last_requested_seq_no" : 514,
            "outstanding_read_requests" : 0,
            "outstanding_write_requests" : 0,
            "write_buffer_operation_count" : 0,
            "write_buffer_size_in_bytes" : 0,
            "follower_mapping_version" : 0,
            "follower_settings_version" : 0,
            "total_read_time_millis" : 0,
            "total_read_remote_exec_time_millis" : 0,
            "successful_read_requests" : 0,
            "failed_read_requests" : 0,
            "operations_read" : 0,
            "bytes_read" : 0,
            "total_write_time_millis" : 0,
            "successful_write_requests" : 0,
            "failed_write_requests" : 0,
            "operations_written" : 0,
            "read_exceptions" : [ ],
            "time_since_last_read_millis" : -1,
            "fatal_exception" : {
              "type" : "master_not_discovered_exception",
              "reason" : null
            }
          },
          {
            "remote_cluster" : "elastic-uswest",
            "leader_index" : "visit-global-standard-w2019.30-1024",
            "follower_index" : "ccr-visit-global-standard-w2019.30-1024",
            "shard_id" : 1,
            "leader_global_checkpoint" : 541,
            "leader_max_seq_no" : 541,
            "follower_global_checkpoint" : 541,
            "follower_max_seq_no" : 541,
            "last_requested_seq_no" : 541,
            "outstanding_read_requests" : 0,
            "outstanding_write_requests" : 0,
            "write_buffer_operation_count" : 0,
            "write_buffer_size_in_bytes" : 0,
            "follower_mapping_version" : 0,
            "follower_settings_version" : 0,
            "total_read_time_millis" : 0,
            "total_read_remote_exec_time_millis" : 0,
            "successful_read_requests" : 0,
            "failed_read_requests" : 0,
            "operations_read" : 0,
            "bytes_read" : 0,
            "total_write_time_millis" : 0,
            "successful_write_requests" : 0,
            "failed_write_requests" : 0,
            "operations_written" : 0,
            "read_exceptions" : [ ],
            "time_since_last_read_millis" : -1,
            "fatal_exception" : {
              "type" : "master_not_discovered_exception",
              "reason" : null
            }
          },
          {
            "remote_cluster" : "elastic-uswest",
            "leader_index" : "visit-global-standard-w2019.30-1024",
            "follower_index" : "ccr-visit-global-standard-w2019.30-1024",
            "shard_id" : 2,
            "leader_global_checkpoint" : 533,
            "leader_max_seq_no" : 533,
            "follower_global_checkpoint" : 533,
            "follower_max_seq_no" : 533,
            "last_requested_seq_no" : 533,
            "outstanding_read_requests" : 0,
            "outstanding_write_requests" : 0,
            "write_buffer_operation_count" : 0,
            "write_buffer_size_in_bytes" : 0,
            "follower_mapping_version" : 0,
            "follower_settings_version" : 0,
            "total_read_time_millis" : 0,
            "total_read_remote_exec_time_millis" : 0,
            "successful_read_requests" : 0,
            "failed_read_requests" : 0,
            "operations_read" : 0,
            "bytes_read" : 0,
            "total_write_time_millis" : 0,
            "successful_write_requests" : 0,
            "failed_write_requests" : 0,
            "operations_written" : 0,
            "read_exceptions" : [ ],
            "time_since_last_read_millis" : -1,
            "fatal_exception" : {
              "type" : "master_not_discovered_exception",
              "reason" : null
            }
          }
        ]
      }

Okay, that's comforting.

It means that when the auto-follower tried to execute to see if there were new indices, that the node on the leader side was not connected. The next time that the auto-follower runs, it will try again; this happens on cluster state updates on the leader. We keep track of the most recent errors in the stats, for diagnostics purposes. It shows here that the error occurred once, on 2019-08-01, so it appears that since then the auto-follower has been executing without issue.

Ahh, so error messages which occur in ccr/stats can be old errors/warnings which are already resolved? Good to know :slight_smile:

Thanks for your help jason!

You're welcome, thanks for your interest in Elasticsearch and CCR.