Hello,
we have kind of a special problem, so I try to explain everything in detail.
Setup
The above diagram shows our current setup (simplified).
Our app runs within a Kubernetes / Openshift Cluster. The Deployment is scaled so we have multiple Pods of one app.
From the K8s Cluster the Request goes to the Firewall, this Firewall (FW1) is via IPsec tunnel to another datacenter connected (No outgoing traffic block).
The Firewall in DC 2 (FW2) allows traffic on Port 443 and 9200. It also runs an HAProxy instance which handles TLS termination and loadbalancing for the 3 Elastic Nodes.
The traffic from HAProxy through the Elastic Nodes is default HTTPS traffic on Port 9200.
Problem
Occasionally we get the following Error in some apps:
Elasticsearch\Common\Exceptions\NoNodesAvailableException
No alive nodes found in your cluster
We handle million of Request per Hour put only 1-2% throw this error. If it happens it only occurs in 1 of maybe 5 Pods of the same application. Sometimes multiple apps have this problem an the same time, but other times only one of them has it.
Tried fixes
We have tried increasing the connection timeout and reading timeout.
We searched through the Elasticsearch Logs and also tried to manually reproduce the issue.
We also changed the connection Pool from staticNoPingConnectionPool to the normal staticConnectionPool:
$client = ClientBuilder::create()
->setConnectionPool('\Elasticsearch\ConnectionPool\StaticConnectionPool', [])
->build();
Nothing worked.
After we configured our applications to connect directly to one of the nodes without Loadbalancer the errors have stopped.
We have searched the last days through the Internet (Elastic Discourse, Github Repos, HAProxy Forum, Reddit and half of Google) for any solution.
We are currently completely out of ideas.
Thanks in advanced for any Help.