Cross cluster search stops working after some time. ES: 6.1.1 Kibana: 6.0

I am testing Cross cluster search.
I have 2 cluster with 2 nodes each one, but only works when i restart both clusters.
I can see the results from both cluster in Kibana and console (throught search API).
I can see the results of _remote/info API.

The problem is: After several minutes (between 10-20min) Cross cluster search don´t work anymore.

Kibana show:
Error: Request Timeout after 120000ms
at http://192.168.2.240:562/bundles/kibana.bundle.js?v=16070:13:4431
at http://192.168.2.240:562/bundles/kibana.bundle.js?v=16070:13:4852

The _remote/info API show this error after several minutes without response:
curl 10.98.129.90:9203/_remote/info?pretty
{
"error" : {
"root_cause" : [
{
"type" : "node_disconnected_exception",
"reason" : "[node-1][10.98.119.90:9300][cluster:monitor/nodes/info] disconnected"
}
],
"type" : "node_disconnected_exception",
"reason" : "[node-1][10.98.119.90:9300][cluster:monitor/nodes/info] disconnected"
},
"status" : 500
}

node-1 is always connected and responding ping.
There is not firewall between 2 clusters.
both clusters with firewalld service stopped
All traffic is permitted between 2 clusters.

¿somebody know what is wrong?

Regards!

Clusters and nodes INFO:-----------------------------------------------------------------------
Clusters:
my-cluster-a: 10.98.119.90
node-1(only data), JVM instance
node-2 (elegible master, data) JVM instance

my-cluster-b: 10.98.129.90
node-3(only data), JVM instance
node-4 (elegible master, data), JVM instance

my-cluster-a is added as remote server in my-cluster-b
Versions:
Logstash, Kibana: 6.0
Elasticsearch: 6.1.1

_remote/info my-cluster-a
curl 10.98.119.90:9201/_remote/info?pretty
{ }

_remote/info my-cluster-b (only once or twice after restart both cluster)
curl 10.98.129.90:9202/_remote/info?pretty
{
"my-cluster-b" : {
"seeds" : [
"10.98.129.90:9302",
"10.98.129.90:9303"
],
"http_addresses" : [
"10.98.129.90:9202",
"10.98.129.90:9203"
],
"connected" : true,
"num_nodes_connected" : 2,
"max_connections_per_cluster" : 3,
"initial_connect_timeout" : "30s",
"skip_unavailable" : false
},
"my-cluster-a" : {
"seeds" : [
"10.98.119.90:9300",
"10.98.119.90:9301"
],
"http_addresses" : [
"10.98.119.90:9200",
"10.98.119.90:9201"
],
"connected" : true,
"num_nodes_connected" : 2,
"max_connections_per_cluster" : 3,
"initial_connect_timeout" : "30s",
"skip_unavailable" : false
}
}

Netstat in my-cluster-b when cross cluster search don´t works

netstat -nap | grep 10.98.119.90
tcp6 0 0 10.98.129.90:57326 10.98.119.90:9301 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:57323 10.98.119.90:9301 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:57324 10.98.119.90:9301 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:57304 10.98.119.90:9301 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:57305 10.98.119.90:9301 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:57306 10.98.119.90:9301 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:38929 10.98.119.90:9300 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:38930 10.98.119.90:9300 ESTABLISHED 60433/java
tcp6 0 84 10.98.129.90:38900 10.98.119.90:9300 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:57307 10.98.119.90:9301 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:38903 10.98.119.90:9300 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:38926 10.98.119.90:9300 ESTABLISHED 60433/java
tcp6 0 62 10.98.129.90:57308 10.98.119.90:9301 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:38902 10.98.119.90:9300 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:38927 10.98.119.90:9300 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:57322 10.98.119.90:9301 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:38904 10.98.119.90:9300 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:38928 10.98.119.90:9300 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:57325 10.98.119.90:9301 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:57303 10.98.119.90:9301 ESTABLISHED 60433/java
tcp6 0 0 10.98.129.90:57327 10.98.119.90:9301 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:38899 10.98.119.90:9300 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:38901 10.98.119.90:9300 ESTABLISHED 60593/java
tcp6 0 0 10.98.129.90:38925 10.98.119.90:9300 ESTABLISHED 60433/java

I'm having a problem after upgrading to 6.1.1 too, I'm not sure if it's related to this, but can you check your elasticsearch log and see if you're getting nearly continuous gc messages?

My cluster seems OK until I run a search that worked OK in 5.x. The search ran about 30 seconds before and returned a lot of data, but it worked.

Thanks

yeah, i am getting continuous gc messages:

"[2018-01-08T15:14:18,377][INFO ][o.e.m.j.JvmGcMonitorService] [node-3] [gc][259152] overhead, spent [306ms] collecting in the last [1s]"

When the cross cluster search stop working i get continuos:

"[2018-01-08T11:59:47,436][WARN ][o.e.t.n.Netty4Transport ] [node-3] exception caught on transport layer [org.elasticsearch.transport.netty4.NettyTcpChannel@1d70087], closing connection
java.io.IOException: Expiró el tiempo de conexión"

Sounds like a garbage problem in 6.1.1, it just depends on what your are doing as to what breaks.

Based in this comment these logs are normal, moreover i am getting few GC logs per hour, for example: at 8am:
Node-3:-----------------------
[2018-01-10T08:04:41,292][INFO ][o.e.m.j.JvmGcMonitorService] [node-3] [gc][405989] overhead, spent [322ms] collecting in the last [1s]
[2018-01-10T08:14:10,063][INFO ][o.e.m.j.JvmGcMonitorService] [node-3] [gc][406557] overhead, spent [354ms] collecting in the last [1s]
[2018-01-10T08:34:07,748][INFO ][o.e.m.j.JvmGcMonitorService] [node-3] [gc][407753] overhead, spent [298ms] collecting in the last [1s]
[2018-01-10T08:44:06,484][INFO ][o.e.m.j.JvmGcMonitorService] [node-3] [gc][408351] overhead, spent [271ms] collecting in the last [1s]
[2018-01-10T08:54:05,269][INFO ][o.e.m.j.JvmGcMonitorService] [node-3] [gc][408949] overhead, spent [251ms] collecting in the last [1s]

Node-4:-----------------------
[2018-01-10T08:50:00,775][INFO ][o.e.m.j.JvmGcMonitorService] [node-4] [gc][409052] overhead, spent [252ms] collecting in the last [1s]
[2018-01-10T08:53:40,877][INFO ][o.e.m.j.JvmGcMonitorService] [node-4] [gc][409272] overhead, spent [288ms] collecting in the last [1s]

Questions:

  1. do these logs evidence a bad configuration o cluster problem?
  2. which relation have these logs and the cross cluster search queries?

OK, when I have the problem, I'm getting those every second. That's what I meant by "continuous".

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.