I have an Elasticsearch setup at 5 physical sites. Kibana is installed at two of them, and I've configured cross-cluster search to enable searching data at the other locations. Each Kibana instance is separate, with separate configs (DR headspace).
Each site has two ES servers with the gateway role enabled (it's disabled on all other nodes). Cross-cluster sites are configured to hit these two gateway servers on port 9300.
I've set up Kibana to have an index pattern for each site (i.e so I can specifically search, say "USA:logstash-") as well as a pattern to search all sites (":logstash-*").
Here's the problem:
USUALLY everything is fine. Both Kibana sites can search all the other sites (individually and using the global pattern). But intermittently, users will report failures searching certain sites (and consequently, the global pattern as well). At first I suspected some kind of network issue (the sites are across the globe) but any low level network tests I do to port 9300 always work fine. I haven't been able to see any sort of pattern.
I might sit down and search against a site, and it doesn't return. Wait 20 mins and then it works.
Kibana, when it fails, comes back with a Bad Gateway error.
I've poked around in the logs of the Kibana nodes, the "Gateway" nodes, and others but I'm not seeing anything that's jumping out at me. Also there are so many nodes in this environment it's getting pretty hard to even figure out where to look
There IS an nginx instance in front of my Kibana nodes. Each site with Kibana has two servers, nginx load balances between them (just for HA purposes).
Everything I search for with regards to CCS is usually around "it doesn't work at all". Here it does work, most of the time. I'm just getting to a loss to explain why it's not working ALL the time.
Does anyone have any suggestions on what to look for to help track down what might be causing this?
Chris