Occasionally NoAliveNodesFound with HAProxy as Loadbalancer

Hello,

we have kind of a special problem, so I try to explain everything in detail.

Setup

The above diagram shows our current setup (simplified).

Our app runs within a Kubernetes / Openshift Cluster. The Deployment is scaled so we have multiple Pods of one app.
From the K8s Cluster the Request goes to the Firewall, this Firewall (FW1) is via IPsec tunnel to another datacenter connected (No outgoing traffic block).

The Firewall in DC 2 (FW2) allows traffic on Port 443 and 9200. It also runs an HAProxy instance which handles TLS termination and loadbalancing for the 3 Elastic Nodes.

The traffic from HAProxy through the Elastic Nodes is default HTTPS traffic on Port 9200.

Problem

Occasionally we get the following Error in some apps:

Elasticsearch\Common\Exceptions\NoNodesAvailableException

No alive nodes found in your cluster

We handle million of Request per Hour put only 1-2% throw this error. If it happens it only occurs in 1 of maybe 5 Pods of the same application. Sometimes multiple apps have this problem an the same time, but other times only one of them has it.

Tried fixes

We have tried increasing the connection timeout and reading timeout.
We searched through the Elasticsearch Logs and also tried to manually reproduce the issue.

We also changed the connection Pool from staticNoPingConnectionPool to the normal staticConnectionPool:

$client = ClientBuilder::create()
    ->setConnectionPool('\Elasticsearch\ConnectionPool\StaticConnectionPool', [])
    ->build();

Nothing worked.

After we configured our applications to connect directly to one of the nodes without Loadbalancer the errors have stopped.


We have searched the last days through the Internet (Elastic Discourse, Github Repos, HAProxy Forum, Reddit and half of Google) for any solution.

We are currently completely out of ideas.

Thanks in advanced for any Help.

For everyone who comes across this post and has the same Problem:

The solution was the change of the connection Pool.

Thanks for sharing that solution!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.