Elasticsearch Cross Cluster Replication connect_timeout

We are running 2 instances of elasticsearch running version 6.8.1. The first instance in our production environment and the second one is in our development environment. We're attempting to replicate the data from the production site to our development site so we can use it in testing for an application that will be hitting elasticsearch.

We've opened port 9300 between these two environments and I can see in the Kibana GUI that the production cluster is connected under Remote Clusters. However when I create a follower index in order to test the replication from production to development the shards on the development cluster are failing to allocate, stating connect_timeout

[2020-04-13T15:56:47,245][WARN ][o.e.c.r.a.AllocationService] [devMaster] failing shard [failed shard, shard [builds-20200410][3], node[fwXUzBLmQDWEAjTLPJCVCw], [P], recovery_source[snapshot recovery [GlgFqTxrRIGxpWIldwRPdg] from _ccr_production:_latest_/_latest_], s[INITIALIZING], a[id=3FBNINKYSk2eXzVb-dw5ww], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-04-13T15:55:16.414Z], failed_attempts[4], failed_nodes[[jItPymr0QzifCI9Km3UkOg, Zpl7miC4T_SFsm2RRBUi1A, bxgXoKIBTpKYyEADXfCFTg, fwXUzBLmQDWEAjTLPJCVCw]], delayed=false, details[failed shard on node [Zpl7miC4T_SFsm2RRBUi1A]: failed recovery, failure RecoveryFailedException[[builds-20200410][3]: Recovery failed on {DevMaster2}{Zpl7miC4T_SFsm2RRBUi1A}{xE00h9O8Sbq7PMuUK1rjqw}{DevMaster2}{IP:9300}{dilm}{ml.machine_memory=135020195840, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: ConnectTransportException[[][IP:9300] connect_timeout[30s]]; ], allocation_status[fetching_shard_data]], expected_shard_size[0], message [failed recovery], failure [RecoveryFailedException[[builds-20200410][3]: Recovery failed on {DevMaster3}{fwXUzBLmQDWEAjTLPJCVCw}{EJvLsrXsTX2pdZ2SvJsYFQ}{DevMaster3}{IP:9300}{dilm}{ml.machine_memory=67378692096, xpack.installed=true, ml.max_open_jobs=20}]; nested: IndexShardRecoveryException[failed recovery]; nested: IndexShardRestoreFailedException[restore failed]; nested: ConnectTransportException[[][IP:9300] connect_timeout[30s]]; ], markAsStale [true]]
org.elasticsearch.indices.recovery.RecoveryFailedException: [builds-20200410][3]: Recovery failed on {DevMaster3}{fwXUzBLmQDWEAjTLPJCVCw}{EJvLsrXsTX2pdZ2SvJsYFQ}{DevMaster3}{IP:9300}{dilm}{ml.machine_memory=67378692096, xpack.installed=true, ml.max_open_jobs=20}
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$17(IndexShard.java:2584) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[elasticsearch-7.5.2.jar:7.5.2]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:353) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1867) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$17(IndexShard.java:2580) ~[elasticsearch-7.5.2.jar:7.5.2]
        ... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: restore failed
        at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:480) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:285) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:308) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:283) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1867) ~[elasticsearch-7.5.2.jar:7.5.2]
        at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$17(IndexShard.java:2580) ~[elasticsearch-7.5.2.jar:7.5.2]
        ... 4 more
Caused by: org.elasticsearch.transport.ConnectTransportException: [][IP:9300] connect_timeout[30s]
        at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onTimeout(TcpTransport.java:995) ~[elasticsearch-7.5.2.jar:7.5.2]
        ... 4 more
[2020-04-13T15:56:48,180][DEBUG][o.e.a.a.c.s.r.RestoreClusterStateListener] [DevMaster] restore of [_latest_/_latest_] completed

If you are running Elasticsearch 6.8.1 in both clusters, why is the stacktrace indicating Elasticsearch 7.5.2?

Apologies for that, I forgot that we initially upgraded but had to downgrade our production cluster because the Cloudbees Jenkins Elasticsearch Plugin is not compatible with Elasticsearch 7.x. Would the dev instance being 7.5.2 be the issue?

To clarify my previous incorrect statement. The production cluster is running 6.8.1 and we're attempting to replicate data to our development cluster running 7.5.2. If needed I can downgrade the dev cluster

Based on this table it looks like that combination should be fine, so will need to leave this for someone with more experience.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.