We are having a problem with timeouts and I was hoping someone could help. Thanks a lot in advance.
Our cluster. ES 2.3.5, 10 machines (3 master nodes, 7 data nodes), most of them are running Ubuntu 14.04, settings are mostly the ones found in the deb package (exact settings available upon request), plugins are [head, cloud-aws, repository-hdfs]. We're using nginx for handling https, and ufw for a firewall (everyone involved has been whitelisted for the correct ports). Machines are running on Hetzner, which provides self-managed servers in Germany. Everyone has a fixed IP address, but we refer to them using a DNS that points only to the data nodes, done via AWS Route 53. We have a large number of shards (~4.8k) associated with a large number of indices (~2.4k). We use sematext monitoring, and everything appears healthy.
Our problem. We are getting an intermittent NoNodeAvailableException on our first request, after failing to verify nodes when we addTransportAddress(...) for the cluster. When we bump up the logging to trace, we see timeout exceptions for the channel connections. If we increase the timeout by 2x, we still see timeout exceptions. If we increase the timeout by 10x, we see disconnect exceptions, and well before a 10x timeout. I can get all of the stack traces upon request. We've turned sniffing on, it appears that if we fail on one machine we fail on all of them, but given the time required to sniff a 10 node cluster, we did not explore this thoroughly.
Our explored causes. We have ruled out the most common causes. We are using the same JVM, the same version of ES, the known firewalls appear to be working as normal, and the file descriptors are well under limit. We've tried restarting the machines, no change in behavior. We've tried making requests to only at one machine consistently, no change in behavior. We've done a forcemerge, and we've reduced the number of shards by about 2%, and we did not notice an improvement. We cannot currently lower the number of shards any further.
Our current workarounds. If we put a long wait (e.g. 120 seconds) between the addTransportAddress(...) and the first request, we are significantly more likely to see a successful connection. If we reduce the number of channels used to only the essential 5, we are always successful. If we reduce to an intermediate number of channels, we see an improvement, but not always success.
I would love any advice I can get. Thanks again.