Cross Cluster Search - Intermittant Failures

I have an Elasticsearch setup at 5 physical sites. Kibana is installed at two of them, and I've configured cross-cluster search to enable searching data at the other locations. Each Kibana instance is separate, with separate configs (DR headspace).

Each site has two ES servers with the gateway role enabled (it's disabled on all other nodes). Cross-cluster sites are configured to hit these two gateway servers on port 9300.

I've set up Kibana to have an index pattern for each site (i.e so I can specifically search, say "USA:logstash-") as well as a pattern to search all sites (":logstash-*").

Here's the problem:

USUALLY everything is fine. Both Kibana sites can search all the other sites (individually and using the global pattern). But intermittently, users will report failures searching certain sites (and consequently, the global pattern as well). At first I suspected some kind of network issue (the sites are across the globe) but any low level network tests I do to port 9300 always work fine. I haven't been able to see any sort of pattern.

I might sit down and search against a site, and it doesn't return. Wait 20 mins and then it works.

Kibana, when it fails, comes back with a Bad Gateway error.

I've poked around in the logs of the Kibana nodes, the "Gateway" nodes, and others but I'm not seeing anything that's jumping out at me. Also there are so many nodes in this environment it's getting pretty hard to even figure out where to look :stuck_out_tongue:

There IS an nginx instance in front of my Kibana nodes. Each site with Kibana has two servers, nginx load balances between them (just for HA purposes).

Everything I search for with regards to CCS is usually around "it doesn't work at all". Here it does work, most of the time. I'm just getting to a loss to explain why it's not working ALL the time.

Does anyone have any suggestions on what to look for to help track down what might be causing this?

Chris

I don't think Elasticsearch ever emits a 502 Bad Gateway response (it never acts as a gateway in that sense) so this error is coming from something else. It would be useful to know how Elasticsearch itself is responding (or failing to respond) to cause that error, before being helpfully translated by the "something else" into a 502.

If I had to guess, the first thing I'd investigate is whether any of the long-running inter-cluster connections are being dropped by an intermediary. These connections can sometimes be idle for quite some time and some networks are overly enthusiastic about terminating idle connections. The simplest way to tell (in ≥ 6.7.0) is to configure a per-remote transport.ping_schedule to prevent the connections from being idle for too long. If that helps then dropped idle connections was likely your issue.

Certainly worth trying turning the ping schedule to a lower value. I seem to remember setting this to 30sec but memory could be failing me.

I'm not entirely sure it's a 502 response. It might be. It's not like what I see on my screen is the classic blank page with a 502 error; it shows in kibana at the top as a pink message I can click on More Info or Ok to dismiss. Could be the same thing :slight_smile:

I do find it ...irritating? that when I can't do a cross cluster search, I can go into the cross-cluster status area and see everything connected. Then go right back to the search and still not working.

Sure, that does sound irritating :slight_smile:

Unfortunately a stack trace from the Javascript doesn't really enlighten us, but the phrase Bad Gateway is definitely not an Elasticsearch error message.

I'm not completely sure what logs to look for yet because the problem is hidden behind Kibana and Nginx. Can you reproduce this more directly (e.g. use curl to run the same search on specific Elasticsearch node? Include the ?error_trace query parameter so we get all the gory details in the response too.

Absolutely, let me go try to do it with curl and see if I can reproduce it there. Thanks!

Ok, got some more. I'm going to describe in detail what I did as I'm now slightly more confused.

As a reminder, I have 5 sites; at each site there are two nodes that are configured to do CCS operations. At sites without Kibana, these two nodes are dedicated to this function. At sites with Kibana, these nodes are ON the Kibana nodes.

For my tests, I threw curl queries directly against the ES install on the Kibana nodes, AND at one of the site's data nodes (to compare differences).

In General, here was my query:

curl -XGET -k -s -u (user):(password) -H 'Content-Type: application/json' https://(server):9200/(clustername):logstash-*/_search -d'
{
  "query": {
    "range": {
      "@timestamp": {
        "gt": "now-15m"
      }
    }
  }
}
'

I did this against all the clusters over the last hour. MOST of the time, all worked as expected. Occasionally, a particular cluster appeared to just never return anything; I'd leave it "hung" for several minutes before giving up and Ctrl-Cing. Wait 5 mins, retry, it worked. One cluster not working did not mean others didn't work. The ones that didn't work were random, and again very intermittent.

I suspect the "no return whatsoever" is causing nginx in front of our Kibana nodes to eventually give up and probably it's producing the actual "Bad Gateway" error we're seeing in Kibana...

Interestingly enough (although not sure this is related) if I through the queries to one of the ES Data nodes directly I never get any hits (I get proper responses, just with no documents). This seems strange to me but not my primary problem so I'll ignore that for now :slight_smile:

So if I'm querying an ES node directly on the Kibana servers, I feel like whatever this is has nothing to do with nginx / Kibana and is in fact something in ES itself.

Last thing to add...I checked the cluster settings for the remote clusters, and they are all basically this:

{
  "persistent" : {
    "cluster" : {
      "remote" : {
        "Canada" : {
          "skip_unavailable" : "true",
          "seeds" : [
            "canweb01:9300",
            "canweb02:9300"
          ],
          "transport" : {
            "ping_schedule" : "15s"
          }
        },

So Ping Schedule set fairly low, and even skip_unavailable is set to true.

:frowning:

Yes, your reasoning seems sound to me.

Yes, it's good to rule those layers out. This is useful information. Now the question is why Elasticsearch is not responding. Is there anything in the Elasticsearch logs (on any node, in perhaps any cluster) that looks unusual at around the time of the hanging query? Similarly any logs in between then and when it worked suggesting that something had fixed itself?

When you run a search it will create tasks on each node that's involved. Do you see those tasks being created? Do you see them hanging around for much longer than expected?

Unfortunately I don't have the ES logs themselves in ES (because reasons outside my control). And with the 5 sites, gateways, data nodes, etc we're talking 100+ machines so...just a lot of monkey work :smiley:

The tasks thing I'm new to. I'll look at that.

Ok, here's something. As I was resigning myself to grep hell, I decided to recheck the ES logs on the Kibana/ES node I was actually querying against (reminder this is just a coordinating-only / "CCS Gateway" node). I have new log errors!

One of these was generated for each curl query I did that failed:

[2020-02-11T15:46:03,626][WARN ][r.suppressed             ] [CANWEB01] path: /Sweden:logstash-*/_search, params: {index=Sweden:logstash-*}
org.elasticsearch.transport.RemoteTransportException: [error while communicating with remote cluster [Sweden]]
Caused by: org.elasticsearch.transport.NodeDisconnectedException: [STOESA01][172.16.xx.xx:9300][indices:data/read/search] disconnected

This is the first direct error I've found, and it's pretty direct in pointing at specific servers.

(maybe lightbulb) Another error that doesn't correspond directly to the queries but is on all my gateway nodes, randomly, are errors like this:

[2020-02-11T02:52:32,873][WARN ][o.e.t.TcpTransport       ] [CANWEB01] exception caught on transport layer [Netty4TcpChannel{localAddress=/172.16.xx.xx:40736, remoteAddress=172.16.xx.xx/172.16.xx.xx:9300}], closing connection
java.io.IOException: Network is unreachable
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[?:?]
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[?:?]
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:276) ~[?:?]
    at sun.nio.ch.IOUtil.read(IOUtil.java:245) ~[?:?]
    at sun.nio.ch.IOUtil.read(IOUtil.java:223) ~[?:?]
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:358) ~[?:?]
    at io.netty.buffer.PooledHeapByteBuf.setBytes(PooledHeapByteBuf.java:261) ~[netty-buffer-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1125) ~[netty-buffer-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) ~[netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.35.Final.jar:4.1.35.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.35.Final.jar:4.1.35.Final]
    at java.lang.Thread.run(Thread.java:835) [?:?]

The IP's all point between the various gateway nodes, so I am now wondering if, what you suggested earlier about the network devices possibly dropping connections, might be a real factor. I would have thought the 15s ping schedule would have addressed that but....hey, this is starting to feel like something networky :smiley:

I'm sure I can turn up the verbosity of the ES logs; maybe doing this on my gateway node and causing one of the query errors might show more detail...

Yes that's a pretty low-level thing, it means that something in libc returned ENETUNREACH. Definitely points towards an infrastructural issue.

Thanks so much for your help David. I'm going to start knocking on network folk's doors :smiley:

1 Like

I've been working with one of our network techs trying to get a packet capture on both sides of the CCS links, looking for anything that might show a problem at the network layer.

Nothing yet, except this caught my eye...these Elasticsearch packets.

Curious if these are...normal or not :slight_smile:

Unfortunately Wireshark's Elasticsearch dissector is rather a long way out of date so I'm not too surprised that it is not behaving reasonably here.

Fortunately I don't think you're looking for any problems that need you to dissect this traffic. The Network is unreachable message shared above comes from quite a few layers lower down.

Ok, thanks again. You had suggested that "something in libc returned ENETUNREACH", and that it suggested an infrastructural issue. I'm trying to figure out what to look for next. My talks with our network folks were vaguely along the lines of "look for anything on the network infra between these devices that might look broken"...which I realize is terribly inspecific. We ended up thinking the TCP connections were getting broken, which was where the packet capturing thought came from.

I'm afraid the inner workings of libc are just not something I'm familiar with. Do you have any suggestions on where I might continue hunting?

I'll be honest, Network is unreachable not an error I've seen very frequently. I think I've seen this caused by the local network interface being taken down, or by something messing up the local routing table, or an intermediate router doing something wrong. In the first case you probably get log messages from the kernel (visible with dmesg) and in the third case you should see an ICMP Destination Unreachable message being received. No idea how to diagnose the second case apart from hunting down whatever's messing with the routing table.

Hmm ok. "Messing with the routing table" actually gives me a focus on the systems themselves. They happen to be VM's on an ESXi infra. I'll start poking in there :slight_smile:

Hah, not sure why I never thought to do this but thinking about your comment on the ICMP Destination unreachable message I started a ping from one box to another and just let it go. After about 40-50 pings it just stops.

This is a much easier test for us to troubleshoot with :smiley:

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.