NodeClosedException while auto-following indices

kley · August 6, 2019, 8:34am

Hi!

So, we do have two elasticsearch environments up and running with CCR configured.
Everything works fine so far. But after checking the document count we noticed a mismatch in numbers. This only seems to affect "bigger" indices.
When digging deeper, we found out, that

_ccr/stats

returns

...
"recent_auto_follow_errors" : [
      {
        "leader_index" : "AutoFollowerPatterns",
        "timestamp" : 1564653416136,
        "auto_follow_exception" : {
          "type" : "exception",
          "reason" : "RemoteTransportException[[10.176.39.38][10.176.39.38:9300][cluster:monitor/state]]; nested: NodeClosedException[node closed {10.176.39.38}{wOe54CAlTKW47GD473DCoA}{qjM8vEWrS5ybm0D2bB8Iaw}{10.176.39.38}{10.176.39.38:9300}{rack_id=us-west-2c, xpack.installed=true, zone=us-west-2}];",
          "caused_by" : {
            "type" : "node_closed_exception",
            "reason" : "node closed {10.176.39.38}{wOe54CAlTKW47GD473DCoA}{qjM8vEWrS5ybm0D2bB8Iaw}{10.176.39.38}{10.176.39.38:9300}{rack_id=us-west-2c, xpack.installed=true, zone=us-west-2}"
          }
        }
      }
    ]
...

In the elastic logs we see the following entry:

[2019-08-01T09:56:56,590][WARN ][o.e.x.c.a.ShardFollowTasksExecutor] [10.176.68.180] [ccr-visit-global-standard-w2019.30-1024][1] background management of retention lease [elastic-useast/ccr-visit-global-standard-w2019.30-1024/FkRtqtL-Q1aujddl0UvzxA-following-elastic-uswest/visit-global-standard-w2019.30-1024/7vymksCxRdeqkz-ThYACcg] failed while following
org.elasticsearch.transport.NodeNotConnectedException: [10.176.39.38][10.176.39.38:9300] Node not connected
        at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:557) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:516) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:251) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:209) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:100) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:62) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:143) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.8.1.jar:6.8.1]
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:81) ~[elasticsearch-6.8.1.jar:6.8.1]
...

I'm not sure if those two messages relate to each other but at first glimpse it does.

So my question is, how can I make sure, that the missing data still gets replicated or how can I prevent this in first place?

Thanks in advance!

p.s we use Elastic 6.8.1

jasontedor · August 6, 2019, 10:34am

Can you share the CCR stats from these indices and how you compared document counts that led to this conclusion?

kley · August 6, 2019, 10:58am

Thanks for your quick response Jason!

I think I made a small mistake... I just recognized that those indices still get data... so, the document count is slightly off because of that. Older indices, which don't get any data any more are fine...

Nevertheless I still got this auto follow error mentioned above.

Here are the stats:

 {
        "index" : "ccr-visit-global-standard-w2019.30-1024",
        "shards" : [
          {
            "remote_cluster" : "elastic-uswest",
            "leader_index" : "visit-global-standard-w2019.30-1024",
            "follower_index" : "ccr-visit-global-standard-w2019.30-1024",
            "shard_id" : 0,
            "leader_global_checkpoint" : 514,
            "leader_max_seq_no" : 514,
            "follower_global_checkpoint" : 514,
            "follower_max_seq_no" : 514,
            "last_requested_seq_no" : 514,
            "outstanding_read_requests" : 0,
            "outstanding_write_requests" : 0,
            "write_buffer_operation_count" : 0,
            "write_buffer_size_in_bytes" : 0,
            "follower_mapping_version" : 0,
            "follower_settings_version" : 0,
            "total_read_time_millis" : 0,
            "total_read_remote_exec_time_millis" : 0,
            "successful_read_requests" : 0,
            "failed_read_requests" : 0,
            "operations_read" : 0,
            "bytes_read" : 0,
            "total_write_time_millis" : 0,
            "successful_write_requests" : 0,
            "failed_write_requests" : 0,
            "operations_written" : 0,
            "read_exceptions" : [ ],
            "time_since_last_read_millis" : -1,
            "fatal_exception" : {
              "type" : "master_not_discovered_exception",
              "reason" : null
            }
          },
          {
            "remote_cluster" : "elastic-uswest",
            "leader_index" : "visit-global-standard-w2019.30-1024",
            "follower_index" : "ccr-visit-global-standard-w2019.30-1024",
            "shard_id" : 1,
            "leader_global_checkpoint" : 541,
            "leader_max_seq_no" : 541,
            "follower_global_checkpoint" : 541,
            "follower_max_seq_no" : 541,
            "last_requested_seq_no" : 541,
            "outstanding_read_requests" : 0,
            "outstanding_write_requests" : 0,
            "write_buffer_operation_count" : 0,
            "write_buffer_size_in_bytes" : 0,
            "follower_mapping_version" : 0,
            "follower_settings_version" : 0,
            "total_read_time_millis" : 0,
            "total_read_remote_exec_time_millis" : 0,
            "successful_read_requests" : 0,
            "failed_read_requests" : 0,
            "operations_read" : 0,
            "bytes_read" : 0,
            "total_write_time_millis" : 0,
            "successful_write_requests" : 0,
            "failed_write_requests" : 0,
            "operations_written" : 0,
            "read_exceptions" : [ ],
            "time_since_last_read_millis" : -1,
            "fatal_exception" : {
              "type" : "master_not_discovered_exception",
              "reason" : null
            }
          },
          {
            "remote_cluster" : "elastic-uswest",
            "leader_index" : "visit-global-standard-w2019.30-1024",
            "follower_index" : "ccr-visit-global-standard-w2019.30-1024",
            "shard_id" : 2,
            "leader_global_checkpoint" : 533,
            "leader_max_seq_no" : 533,
            "follower_global_checkpoint" : 533,
            "follower_max_seq_no" : 533,
            "last_requested_seq_no" : 533,
            "outstanding_read_requests" : 0,
            "outstanding_write_requests" : 0,
            "write_buffer_operation_count" : 0,
            "write_buffer_size_in_bytes" : 0,
            "follower_mapping_version" : 0,
            "follower_settings_version" : 0,
            "total_read_time_millis" : 0,
            "total_read_remote_exec_time_millis" : 0,
            "successful_read_requests" : 0,
            "failed_read_requests" : 0,
            "operations_read" : 0,
            "bytes_read" : 0,
            "total_write_time_millis" : 0,
            "successful_write_requests" : 0,
            "failed_write_requests" : 0,
            "operations_written" : 0,
            "read_exceptions" : [ ],
            "time_since_last_read_millis" : -1,
            "fatal_exception" : {
              "type" : "master_not_discovered_exception",
              "reason" : null
            }
          }
        ]
      }

jasontedor · August 6, 2019, 2:44pm

Okay, that's comforting.

It means that when the auto-follower tried to execute to see if there were new indices, that the node on the leader side was not connected. The next time that the auto-follower runs, it will try again; this happens on cluster state updates on the leader. We keep track of the most recent errors in the stats, for diagnostics purposes. It shows here that the error occurred once, on 2019-08-01, so it appears that since then the auto-follower has been executing without issue.

kley · August 7, 2019, 6:30am

Ahh, so error messages which occur in ccr/stats can be old errors/warnings which are already resolved? Good to know

Thanks for your help jason!

jasontedor · August 7, 2019, 9:46am

You're welcome, thanks for your interest in Elasticsearch and CCR.

system · September 4, 2019, 9:46am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does elasticsearch automatically close replicated indices? Elasticsearch ccr-cross-cluster-replication	13	888	November 21, 2019
CCR - Remove follow_stats and License fetching problem Elasticsearch ccr-cross-cluster-replication	8	2315	May 9, 2019
Closed Index missing after restarting the elasticsearch cluster Elasticsearch	7	428	July 6, 2017
Can not open index Elasticsearch	2	837	July 6, 2017
Inconsistent data among nodes for one (or more) indexes (ES 0.11) Elasticsearch	9	485	July 6, 2017

NodeClosedException while auto-following indices

Related topics