ES version: 6.2.4
os: redhat
Environment: We have 2 clusters (A and B) in 2 different data centers. For searching, we use "cross cluster search" to get data from one cluster(A), and then cluster A will search relevant data form B for us. And all is working fine.
Issue: Sometimes to test the availability of these 2 clusters, we will disable the network between cluster A and cluster B (shut down the network device or with firewall), we hope cluster A can detect the "disconnection" to cluster B, but sometimes cluster A can not detect this "disconnection" immediately (within 1 minutes).
Test: I tested CCS with 2 small clusters (use firewall/iptables to disconnect the network), I found out it seems related to the "network config":
With default TCP configuration(net.ipv4.tcp_keepalive_time=7200 net.ipv4.tcp_keepalive_intvl=75 net.ipv4.tcp_keepalive_probes=9), the cluster need more than 10 minutes to detect the "disconnection". The search will be blocked and the response is timeout, client will get exception bellow(exception 1). The most strange thing is that there isn't any "ERROR/WARN" log in ES log.
With updated TCP configuration(net.ipv4.tcp_keepalive_time=120 net.ipv4.tcp_keepalive_intvl=30 net.ipv4.tcp_keepalive_probes=2), the search response still is "timeout" at first but after 2~3 minutes, the search response is fine (only with data in cluster A, "_clusters":{"total":2,"successful":1,"skipped":1})
I am guessing there is no "heartbeat or ping" between different clusters when "CCS" is working. Not like "transport.ping_schedule" is used between transport connections.
BTW: I am using rest API to set the "CCS":
{
"persistent": {
"search": {
"remote": {
"clusterA": {
"skip_unavailable": "true",
"seeds": [
"X.X.X.X:9300"
]
},
"clusterB": {
"skip_unavailable": "true",
"seeds": [
"X2.X2.X2.X2:9300"
]
}
}
}
},
"transient": {}
}
exception 1:
java.lang.RuntimeException: java.io.IOException: listener timeout after waiting for [30000] ms
at jdk.nashorn.internal.runtime.ScriptRuntime.apply(ScriptRuntime.java:397) ~[nashorn.jar:?]
at jdk.nashorn.api.scripting.ScriptObjectMirror.callMember(ScriptObjectMirror.java:199) ~[nashorn.jar:?]
at jdk.nashorn.api.scripting.NashornScriptEngine.invokeImpl(NashornScriptEngine.java:383) ~[nashorn.jar:?]
at jdk.nashorn.api.scripting.NashornScriptEngine.invokeFunction(NashornScriptEngine.java:190) ~[nashorn.jar:?]
at com.htsc.iscs.service.alarm.impl.RangeWatcherImpl.alarmCheck(RangeWatcherImpl.java:152) [service-2.0.0-SNAPSHOT.jar!/:2.0.0-SNAPSHOT]
at com.htsc.iscs.watcher.WatcherManager.watch(WatcherManager.java:159) [classes!/:2.0.0-SNAPSHOT]
at com.htsc.iscs.watcher.WatcherManager.access$100(WatcherManager.java:36) [classes!/:2.0.0-SNAPSHOT]
at com.htsc.iscs.watcher.WatcherManager$1.run(WatcherManager.java:129) [classes!/:2.0.0-SNAPSHOT]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_101]
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_101]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_101]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_101]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_101]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_101]
Caused by: java.io.IOException: listener timeout after waiting for [30000] ms
at org.elasticsearch.client.RestClient$SyncResponseListener.get(RestClient.java:665) ~[elasticsearch-rest-client-6.2.4.jar!/:6.2.4]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:223) ~[elasticsearch-rest-client-6.2.4.jar!/:6.2.4]
at org.elasticsearch.client.RestClient.performRequest(RestClient.java:195) ~[elasticsearch-rest-client-6.2.4.jar!/:6.2.4]