Timeout Exceptions (NoNodeAvailableException)?

Hi,
We are having a problem with timeouts and I was hoping someone could help. Thanks a lot in advance.

Our cluster. ES 2.3.5, 10 machines (3 master nodes, 7 data nodes), most of them are running Ubuntu 14.04, settings are mostly the ones found in the deb package (exact settings available upon request), plugins are [head, cloud-aws, repository-hdfs]. We're using nginx for handling https, and ufw for a firewall (everyone involved has been whitelisted for the correct ports). Machines are running on Hetzner, which provides self-managed servers in Germany. Everyone has a fixed IP address, but we refer to them using a DNS that points only to the data nodes, done via AWS Route 53. We have a large number of shards (~4.8k) associated with a large number of indices (~2.4k). We use sematext monitoring, and everything appears healthy.

Our problem. We are getting an intermittent NoNodeAvailableException on our first request, after failing to verify nodes when we addTransportAddress(...) for the cluster. When we bump up the logging to trace, we see timeout exceptions for the channel connections. If we increase the timeout by 2x, we still see timeout exceptions. If we increase the timeout by 10x, we see disconnect exceptions, and well before a 10x timeout. I can get all of the stack traces upon request. We've turned sniffing on, it appears that if we fail on one machine we fail on all of them, but given the time required to sniff a 10 node cluster, we did not explore this thoroughly.

Our explored causes. We have ruled out the most common causes. We are using the same JVM, the same version of ES, the known firewalls appear to be working as normal, and the file descriptors are well under limit. We've tried restarting the machines, no change in behavior. We've tried making requests to only at one machine consistently, no change in behavior. We've done a forcemerge, and we've reduced the number of shards by about 2%, and we did not notice an improvement. We cannot currently lower the number of shards any further.

Our current workarounds. If we put a long wait (e.g. 120 seconds) between the addTransportAddress(...) and the first request, we are significantly more likely to see a successful connection. If we reduce the number of channels used to only the essential 5, we are always successful. If we reduce to an intermediate number of channels, we see an improvement, but not always success.

I would love any advice I can get. Thanks again.

Hey,

did you try to specify IP address to rule out DNS issues as a first try?

Also I am a bit confused by your description. You are talking about your server setup, but then mention addTransportAddresses, so the issue is not within your cluster with TransportClient to cluster? Can you elaborate on that?

--Alex

We provided our cluster specs for background information. Yes, the issue occurs when running a TransportClient. We are able to reproduce by creating a TransportClient, adding any address in our cluster, and then trying to run any command (e.g. search or get or cluster stats). We have tried specific IP addresses, it does not resolve the issue.

Hey

can you share logs from the transport client on trace, so we can see the timeout? Plus the code snippet with the configuration.
Also does the same happen if you disable sniffing?

--Alex

I see the same behavior with or without sniffing.

Here's a code snippet:

 public static void main(String[] args) throws ExecutionException, InterruptedException, IOException {
    Settings settings = Settings.builder()
            .put("client.transport.ignore_cluster_name", true)
            .put("client.transport.sniff", false)
            .build();
    try( TransportClient transportClient = TransportClient.builder().settings(settings).build()) {
        transportClient.addTransportAddress(new InetSocketTransportAddress(new InetSocketAddress(NODE_ADDRESS, 9300)));
        transportClient.admin().cluster().prepareHealth().get();
    }
}

And here's a link to a gist with the logs: failure-example-log-1

Hey,

this very much looks like a networking/firewall issue. As you seem to run this on your desktop (judging from the intellij part in your stacktrace), can you run

telnet 136.243.68.7 9300

from that system and see what happens? Does that work? If you do not get a Connected to x.y.z message, than there is some component dropping connections between you and the server.

Why is the log format so different? I assume it is just using your application log format, right?

--Alex

That does work, I am able to connect. Also, yes, that's our application log format. Also, yes, I ran this example code from my desktop, but I've seen the same behavior on our remote servers.

Hey,

a) what does /usr/bin/curl -v IP_ADDR:9300 return? (no, not a typo)
b) can you use netstat and check for open connections when you try to connect? In what state are those?

--Alex

For (a)

$ /usr/bin/curl -v 136.243.68.7:9300
* About to connect() to 136.243.68.7 port 9300 (#0)
*   Trying 136.243.68.7... connected
> GET / HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Host: 136.243.68.7:9300
> Accept: */*
> 
* Connection #0 to host 136.243.68.7 left intact
* Closing connection #0
This is not a HTTP port

For (b) When I run netstat on the target node while trying to connect, I see that a number of connections are ESTABLISHED on both machines almost instantly, with a few more on my local machine in the SYN_SENT state. A few of those then become ESTABLISHED, before the timeout Exceptions appear and those remaining connections go into the TIME_WAIT state.

Hey,

did you see anything on the server side in the logs. You could also increase logging there for a test. Logs on the client side, dont give any particular hint on first sight, as you said.

Another test might be to take NIO out of the equation, by trying to set transport.netty.transport.tcp.blocking_client: true in the client settings.

Also, is it possible that there is maximum number of parallel connections per src/dst host pair enfored by a firewall in between? You could try and create a bunch of open (HTTP) connections to elasticsearch and see if that works.

--Alex

I just tried setting transport.netty.transport.tcp.blocking_client to true, and I am no longer seeing errors. Can you explain what that did and what sort of underlying problem that might indicate? Also, thank you so much!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.