Cross cluster search timing out ; reconnect automatically after random duration

Hi All,

The remote cluster search is enabled in my production environment.

It is working well, however some times below is error is observed in elastic-search logs.

org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster

After some time it is automatically reconnect and remote search is working properly.

We did not observe any ping loss between the nodes.

We are not sure of this elastic-search problem or network problem? How we can debug the problem?

Does there any configuration for reducing the interval for reconnect or any other tuning need to be done?

As of now we did not configured any time intervals.

Pl. let me know any other information required.

Elasticsearch version: 6.6.1

Thanks in advance,
Srirama

Could you please share the stacktrace of that exception, which hopefully contains also the cause of it?

I did not get any notifications hence unable to reply.

Below are the errors observed,

[2019-09-04T15:30:01,952][WARN ][r.suppressed             ] [node-1] path: /stats_cafedemo_04_09_2019,kca-cluster:stats_cafedemo_04_09_2019/_search, params: {size=0, ignore_unavailable=true, index=stats_cafedemo_04_09_2019,kca-cluster:stats_cafedemo_04_09_2019}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [kca-cluster]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [node-4][10.237.92.110:7001][indices:admin/shards/search_shards] disconnected
[2019-09-04T15:30:34,962][WARN ][o.e.t.RemoteClusterConnection] [node-1] fetching nodes from external cluster [kca-cluster] failed
org.elasticsearch.transport.ConnectTransportException: [][10.237.92.107:7001] handshake_timeout[30s]
        at org.elasticsearch.transport.TransportHandshaker.lambda$sendHandshake$1(TransportHandshaker.java:77) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]

Below is one more observed

[2019-09-04T15:36:56,378][WARN ][r.suppressed             ] [node-1] path: /cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019/_search, params: {ignore_unavailable=true, index=cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [kca-cluster]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:admin/shards/search_shards] disconnected
[2019-09-04T15:36:56,378][WARN ][r.suppressed             ] [node-1] path: /cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019/_search, params: {ignore_unavailable=true, index=cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [kca-cluster]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:admin/shards/search_shards] disconnected

Some different error is also observed

[2019-09-04T15:36:56,380][DEBUG][o.e.a.s.TransportSearchAction] [node-1] [cafedemo_04_09_2019][4], node[RQ50cjGaTJyDflmJqQKj5w], [R], s[STARTED], a[id=E1u1mUeOT0u5AwhX_DuzGw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cafedemo_04_09_2019, kca-cluster:cafedemo_04_09_2019], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=40, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, source={"from":0,"size":900,"query":{"bool":{"must":[{"range":{"tt":{"from":1567540800000,"to":1567592791061,"include_lower":true,"include_upper":true,"boost":1.0}}},{"terms":{"msg_status":["0","1","2"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"tt":{"histogram":{"field":"tt","interval":5.1991061E7,"offset":0.0,"order":{"_key":"desc"},"keyed":false,"min_doc_count":0},"aggregations":{"msg_status":{"terms":{"field":"msg_status","size":2147483647,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}}}}}] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:data/read/search[phase/query]] disconnected
[2019-09-04T15:36:56,380][DEBUG][o.e.a.s.TransportSearchAction] [node-1] [cafedemo_04_09_2019][4], node[RQ50cjGaTJyDflmJqQKj5w], [R], s[STARTED], a[id=E1u1mUeOT0u5AwhX_DuzGw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cafedemo_04_09_2019, kca-cluster:cafedemo_04_09_2019], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=40, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, source={"from":0,"size":900,"query":{"bool":{"must":[{"range":{"tt":{"from":1567540800000,"to":1567592871066,"include_lower":true,"include_upper":true,"boost":1.0}}},{"terms":{"msg_status":["0","1","2"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"tt":{"histogram":{"field":"tt","interval":5.2071066E7,"offset":0.0,"order":{"_key":"desc"},"keyed":false,"min_doc_count":0},"aggregations":{"msg_status":{"terms":{"field":"msg_status","size":2147483647,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}}}}}] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:data/read/search[phase/query]] disconnected

In all these scenarios, it is recovering after some time.

From the stacktraces it looks like there are connection problems with that remove cluster, that happen in the different phases of a cross cluster search request.

Ok, from ping stats, it is observed no ping loss is observed.

How we can proceed?

Do we need to tune any elastic configurations?

One more thing (forgot to give) our clusters are communicating over firewall.

Hence we suspected, we are not sure of how to debug this problem, the problem could be as explained in the link below.

Does there any other ELASTIC configuration changes need to do apart from above mentioned.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.