Cross cluster search timing out ; reconnect automatically after random duration

srirama · September 6, 2019, 5:50am

Hi All,

The remote cluster search is enabled in my production environment.

It is working well, however some times below is error is observed in elastic-search logs.

org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster

After some time it is automatically reconnect and remote search is working properly.

We did not observe any ping loss between the nodes.

We are not sure of this elastic-search problem or network problem? How we can debug the problem?

Does there any configuration for reducing the interval for reconnect or any other tuning need to be done?

As of now we did not configured any time intervals.

Pl. let me know any other information required.

Elasticsearch version: 6.6.1

Thanks in advance,
Srirama

javanna · September 13, 2019, 11:47am

Could you please share the stacktrace of that exception, which hopefully contains also the cause of it?

srirama · September 25, 2019, 4:46pm

I did not get any notifications hence unable to reply.

Below are the errors observed,

[2019-09-04T15:30:01,952][WARN ][r.suppressed             ] [node-1] path: /stats_cafedemo_04_09_2019,kca-cluster:stats_cafedemo_04_09_2019/_search, params: {size=0, ignore_unavailable=true, index=stats_cafedemo_04_09_2019,kca-cluster:stats_cafedemo_04_09_2019}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [kca-cluster]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [node-4][10.237.92.110:7001][indices:admin/shards/search_shards] disconnected
[2019-09-04T15:30:34,962][WARN ][o.e.t.RemoteClusterConnection] [node-1] fetching nodes from external cluster [kca-cluster] failed
org.elasticsearch.transport.ConnectTransportException: [][10.237.92.107:7001] handshake_timeout[30s]
        at org.elasticsearch.transport.TransportHandshaker.lambda$sendHandshake$1(TransportHandshaker.java:77) ~[elasticsearch-6.6.1.jar:6.6.1]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660) [elasticsearch-6.6.1.jar:6.6.1]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_211]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_211]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_211]

Below is one more observed

[2019-09-04T15:36:56,378][WARN ][r.suppressed             ] [node-1] path: /cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019/_search, params: {ignore_unavailable=true, index=cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [kca-cluster]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:admin/shards/search_shards] disconnected
[2019-09-04T15:36:56,378][WARN ][r.suppressed             ] [node-1] path: /cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019/_search, params: {ignore_unavailable=true, index=cafedemo_04_09_2019,kca-cluster:cafedemo_04_09_2019}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [kca-cluster]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:admin/shards/search_shards] disconnected

Some different error is also observed

[2019-09-04T15:36:56,380][DEBUG][o.e.a.s.TransportSearchAction] [node-1] [cafedemo_04_09_2019][4], node[RQ50cjGaTJyDflmJqQKj5w], [R], s[STARTED], a[id=E1u1mUeOT0u5AwhX_DuzGw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cafedemo_04_09_2019, kca-cluster:cafedemo_04_09_2019], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=40, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, source={"from":0,"size":900,"query":{"bool":{"must":[{"range":{"tt":{"from":1567540800000,"to":1567592791061,"include_lower":true,"include_upper":true,"boost":1.0}}},{"terms":{"msg_status":["0","1","2"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"tt":{"histogram":{"field":"tt","interval":5.1991061E7,"offset":0.0,"order":{"_key":"desc"},"keyed":false,"min_doc_count":0},"aggregations":{"msg_status":{"terms":{"field":"msg_status","size":2147483647,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}}}}}] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:data/read/search[phase/query]] disconnected
[2019-09-04T15:36:56,380][DEBUG][o.e.a.s.TransportSearchAction] [node-1] [cafedemo_04_09_2019][4], node[RQ50cjGaTJyDflmJqQKj5w], [R], s[STARTED], a[id=E1u1mUeOT0u5AwhX_DuzGw]: Failed to execute [SearchRequest{searchType=QUERY_THEN_FETCH, indices=[cafedemo_04_09_2019, kca-cluster:cafedemo_04_09_2019], indicesOptions=IndicesOptions[ignore_unavailable=true, allow_no_indices=true, expand_wildcards_open=true, expand_wildcards_closed=false, allow_aliases_to_multiple_indices=true, forbid_closed_indices=true, ignore_aliases=false, ignore_throttled=true], types=[], routing='null', preference='null', requestCache=null, scroll=null, maxConcurrentShardRequests=40, batchedReduceSize=512, preFilterShardSize=128, allowPartialSearchResults=true, source={"from":0,"size":900,"query":{"bool":{"must":[{"range":{"tt":{"from":1567540800000,"to":1567592871066,"include_lower":true,"include_upper":true,"boost":1.0}}},{"terms":{"msg_status":["0","1","2"],"boost":1.0}}],"adjust_pure_negative":true,"boost":1.0}},"aggregations":{"tt":{"histogram":{"field":"tt","interval":5.2071066E7,"offset":0.0,"order":{"_key":"desc"},"keyed":false,"min_doc_count":0},"aggregations":{"msg_status":{"terms":{"field":"msg_status","size":2147483647,"min_doc_count":1,"shard_min_doc_count":0,"show_term_doc_count_error":false,"order":[{"_count":"desc"},{"_key":"asc"}]}}}}}}}] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [node-3][10.237.92.109:7001][indices:data/read/search[phase/query]] disconnected

In all these scenarios, it is recovering after some time.

javanna · September 26, 2019, 8:53am

From the stacktraces it looks like there are connection problems with that remove cluster, that happen in the different phases of a cross cluster search request.

srirama · September 27, 2019, 10:45am

Ok, from ping stats, it is observed no ping loss is observed.

How we can proceed?

Do we need to tune any elastic configurations?

srirama · October 1, 2019, 4:45am

One more thing (forgot to give) our clusters are communicating over firewall.

Hence we suspected, we are not sure of how to debug this problem, the problem could be as explained in the link below.

Does there any other ELASTIC configuration changes need to do apart from above mentioned.

system · October 29, 2019, 4:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cross Cluster Search param search.remote.initial_connect_timeout does not work Elasticsearch ccs-cross-cluster-search	3	1166	September 8, 2019
Cross cluster search stops working after some time Elasticsearch	2	1017	December 8, 2017
Remote cluster connect time out Elasticsearch ccs-cross-cluster-search	1	622	November 10, 2021
Cross cluster search stops working after some time. ES: 6.1.1 Kibana: 6.0 Elasticsearch	6	1349	February 7, 2018
Random node disconnects - Java.io.IOException: Connection timed out Elasticsearch	2	5406	July 5, 2017

Cross cluster search timing out ; reconnect automatically after random duration

Related topics