Cross Cluster Search - Intermittant Failures

chris.murray · February 7, 2020, 4:01pm

I have an Elasticsearch setup at 5 physical sites. Kibana is installed at two of them, and I've configured cross-cluster search to enable searching data at the other locations. Each Kibana instance is separate, with separate configs (DR headspace).

Each site has two ES servers with the gateway role enabled (it's disabled on all other nodes). Cross-cluster sites are configured to hit these two gateway servers on port 9300.

I've set up Kibana to have an index pattern for each site (i.e so I can specifically search, say "USA:logstash-") as well as a pattern to search all sites (":logstash-*").

Here's the problem:

USUALLY everything is fine. Both Kibana sites can search all the other sites (individually and using the global pattern). But intermittently, users will report failures searching certain sites (and consequently, the global pattern as well). At first I suspected some kind of network issue (the sites are across the globe) but any low level network tests I do to port 9300 always work fine. I haven't been able to see any sort of pattern.

I might sit down and search against a site, and it doesn't return. Wait 20 mins and then it works.

Kibana, when it fails, comes back with a Bad Gateway error.

I've poked around in the logs of the Kibana nodes, the "Gateway" nodes, and others but I'm not seeing anything that's jumping out at me. Also there are so many nodes in this environment it's getting pretty hard to even figure out where to look

There IS an nginx instance in front of my Kibana nodes. Each site with Kibana has two servers, nginx load balances between them (just for HA purposes).

Everything I search for with regards to CCS is usually around "it doesn't work at all". Here it does work, most of the time. I'm just getting to a loss to explain why it's not working ALL the time.

Does anyone have any suggestions on what to look for to help track down what might be causing this?

Chris

DavidTurner · February 7, 2020, 6:36pm

I don't think Elasticsearch ever emits a 502 Bad Gateway response (it never acts as a gateway in that sense) so this error is coming from something else. It would be useful to know how Elasticsearch itself is responding (or failing to respond) to cause that error, before being helpfully translated by the "something else" into a 502.

If I had to guess, the first thing I'd investigate is whether any of the long-running inter-cluster connections are being dropped by an intermediary. These connections can sometimes be idle for quite some time and some networks are overly enthusiastic about terminating idle connections. The simplest way to tell (in ≥ 6.7.0) is to configure a per-remote transport.ping_schedule to prevent the connections from being idle for too long. If that helps then dropped idle connections was likely your issue.

chris.murray · February 7, 2020, 7:09pm

Certainly worth trying turning the ping schedule to a lower value. I seem to remember setting this to 30sec but memory could be failing me.

I'm not entirely sure it's a 502 response. It might be. It's not like what I see on my screen is the classic blank page with a 502 error; it shows in kibana at the top as a pink message I can click on More Info or Ok to dismiss. Could be the same thing

I do find it ...irritating? that when I can't do a cross cluster search, I can go into the cross-cluster status area and see everything connected. Then go right back to the search and still not working.

chris.murray · February 7, 2020, 7:13pm

DavidTurner · February 7, 2020, 7:17pm

Sure, that does sound irritating

Unfortunately a stack trace from the Javascript doesn't really enlighten us, but the phrase Bad Gateway is definitely not an Elasticsearch error message.

I'm not completely sure what logs to look for yet because the problem is hidden behind Kibana and Nginx. Can you reproduce this more directly (e.g. use curl to run the same search on specific Elasticsearch node? Include the ?error_trace query parameter so we get all the gory details in the response too.

chris.murray · February 7, 2020, 7:19pm

Absolutely, let me go try to do it with curl and see if I can reproduce it there. Thanks!

chris.murray · February 11, 2020, 3:08pm

Ok, got some more. I'm going to describe in detail what I did as I'm now slightly more confused.

As a reminder, I have 5 sites; at each site there are two nodes that are configured to do CCS operations. At sites without Kibana, these two nodes are dedicated to this function. At sites with Kibana, these nodes are ON the Kibana nodes.

For my tests, I threw curl queries directly against the ES install on the Kibana nodes, AND at one of the site's data nodes (to compare differences).

In General, here was my query:

curl -XGET -k -s -u (user):(password) -H 'Content-Type: application/json' https://(server):9200/(clustername):logstash-*/_search -d'
{
  "query": {
    "range": {
      "@timestamp": {
        "gt": "now-15m"
      }
    }
  }
}
'

I did this against all the clusters over the last hour. MOST of the time, all worked as expected. Occasionally, a particular cluster appeared to just never return anything; I'd leave it "hung" for several minutes before giving up and Ctrl-Cing. Wait 5 mins, retry, it worked. One cluster not working did not mean others didn't work. The ones that didn't work were random, and again very intermittent.

I suspect the "no return whatsoever" is causing nginx in front of our Kibana nodes to eventually give up and probably it's producing the actual "Bad Gateway" error we're seeing in Kibana...

Interestingly enough (although not sure this is related) if I through the queries to one of the ES Data nodes directly I never get any hits (I get proper responses, just with no documents). This seems strange to me but not my primary problem so I'll ignore that for now

So if I'm querying an ES node directly on the Kibana servers, I feel like whatever this is has nothing to do with nginx / Kibana and is in fact something in ES itself.

Last thing to add...I checked the cluster settings for the remote clusters, and they are all basically this:

{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "Canada" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "canweb01:9300",
            "canweb02:9300"
          ],
          "transport" : {
            "ping_schedule" : "15s"
          }
        },

So Ping Schedule set fairly low, and even skip_unavailable is set to true.

DavidTurner · February 11, 2020, 3:37pm

Yes, your reasoning seems sound to me.

Yes, it's good to rule those layers out. This is useful information. Now the question is why Elasticsearch is not responding. Is there anything in the Elasticsearch logs (on any node, in perhaps any cluster) that looks unusual at around the time of the hanging query? Similarly any logs in between then and when it worked suggesting that something had fixed itself?

When you run a search it will create tasks on each node that's involved. Do you see those tasks being created? Do you see them hanging around for much longer than expected?

chris.murray · February 11, 2020, 4:00pm

Unfortunately I don't have the ES logs themselves in ES (because reasons outside my control). And with the 5 sites, gateways, data nodes, etc we're talking 100+ machines so...just a lot of monkey work

The tasks thing I'm new to. I'll look at that.

Ok, here's something. As I was resigning myself to grep hell, I decided to recheck the ES logs on the Kibana/ES node I was actually querying against (reminder this is just a coordinating-only / "CCS Gateway" node). I have new log errors!

One of these was generated for each curl query I did that failed:

[2020-02-11T15:46:03,626][WARN ][r.suppressed             ] [CANWEB01] path: /Sweden:logstash-*/_search, params: {index=Sweden:logstash-*}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [Sweden]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [STOESA01][172.16.xx.xx:9300][indices:data/read/search] disconnected

This is the first direct error I've found, and it's pretty direct in pointing at specific servers.

(maybe lightbulb) Another error that doesn't correspond directly to the queries but is on all my gateway nodes, randomly, are errors like this:

[2020-02-11T02:52:32,873][WARN ][o.e.t.TcpTransport       ] [CANWEB01] exception caught on transport layer [Netty4TcpChannel{localAddress=/172.16.xx.xx:40736, remoteAddress=172.16.xx.xx/172.16.xx.xx:9300}], closing connection
java.io.IOException: Network is unreachable
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
    at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
    at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358) ~[?:?]
    at io.netty.buffer.PooledHeapByteBuf.setBytes(PooledHeapByteBuf.java:261) ~[netty-buffer-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) ~[netty-buffer-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) ~[netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.35.Final.jar:4.1.35.Final]
    at java.lang.Thread.run(Thread.java:835) [?:?]

The IP's all point between the various gateway nodes, so I am now wondering if, what you suggested earlier about the network devices possibly dropping connections, might be a real factor. I would have thought the 15s ping schedule would have addressed that but....hey, this is starting to feel like something networky

I'm sure I can turn up the verbosity of the ES logs; maybe doing this on my gateway node and causing one of the query errors might show more detail...

DavidTurner · February 11, 2020, 7:13pm

Yes that's a pretty low-level thing, it means that something in libc returned ENETUNREACH. Definitely points towards an infrastructural issue.

chris.murray · February 11, 2020, 8:03pm

Thanks so much for your help David. I'm going to start knocking on network folk's doors

chris.murray · February 24, 2020, 1:45pm

I've been working with one of our network techs trying to get a packet capture on both sides of the CCS links, looking for anything that might show a problem at the network layer.

Nothing yet, except this caught my eye...these Elasticsearch packets.

Curious if these are...normal or not

DavidTurner · February 24, 2020, 2:07pm

Unfortunately Wireshark's Elasticsearch dissector is rather a long way out of date so I'm not too surprised that it is not behaving reasonably here.

Fortunately I don't think you're looking for any problems that need you to dissect this traffic. The Network is unreachable message shared above comes from quite a few layers lower down.

chris.murray · February 24, 2020, 2:14pm

Ok, thanks again. You had suggested that "something in libc returned ENETUNREACH", and that it suggested an infrastructural issue. I'm trying to figure out what to look for next. My talks with our network folks were vaguely along the lines of "look for anything on the network infra between these devices that might look broken"...which I realize is terribly inspecific. We ended up thinking the TCP connections were getting broken, which was where the packet capturing thought came from.

I'm afraid the inner workings of libc are just not something I'm familiar with. Do you have any suggestions on where I might continue hunting?

DavidTurner · February 24, 2020, 5:04pm

I'll be honest, Network is unreachable not an error I've seen very frequently. I think I've seen this caused by the local network interface being taken down, or by something messing up the local routing table, or an intermediate router doing something wrong. In the first case you probably get log messages from the kernel (visible with dmesg) and in the third case you should see an ICMP Destination Unreachable message being received. No idea how to diagnose the second case apart from hunting down whatever's messing with the routing table.

chris.murray · February 24, 2020, 5:16pm

Hmm ok. "Messing with the routing table" actually gives me a focus on the systems themselves. They happen to be VM's on an ESXi infra. I'll start poking in there

chris.murray · February 24, 2020, 5:54pm

Hah, not sure why I never thought to do this but thinking about your comment on the ICMP Destination unreachable message I started a ping from one box to another and just let it go. After about 40-50 pings it just stops.

This is a much easier test for us to troubleshoot with

system · March 23, 2020, 5:54pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cross cluster search stops working after some time. ES: 6.1.1 Kibana: 6.0 Elasticsearch	6	1349	February 7, 2018
Help with configurring cross cluster on 2 kubenrtes vms Kibana docker , ccs-cross-cluster-search	2	400	March 22, 2021
Cross-cluster-search setup and r/w to ES Elasticsearch	5	1010	June 28, 2017
Dedicated Coordinating Nodes With Cross Cluster Searching Elasticsearch ccs-cross-cluster-search	2	664	October 23, 2019
Adding a new and different ElasticSearch node to Kibana Elasticsearch	4	360	June 7, 2022

Cross Cluster Search - Intermittant Failures

Related topics