Elasticsearch on Kubernetes losing Cluster Connection every hour


I am migrating an older application using elasticsearch 2.4 (I know that's a very old version, but I'm not able to upgrade at this point) to Kubernetes. I set it up using a StatefulSet and a headless service, which basically works fine.
However, the connection between cluster nodes is lost after exactly one hour, reestablished, and then lost after an hour again.

There are no errors in the logs, just the messages from nodes joining and leaving the cluster:

[2020-06-15 04:11:21,652][INFO ][discovery.zen ] [Midas] master_left [{Isis}{2UKej6PQRya4WnEARlEZwA}{}{}], reason [transport disconnected]
[2020-06-15 04:11:21,652][WARN ][discovery.zen ] [Midas] master left (reason = transport disconnected), current nodes: {{Stunner}{QNE6jkMyR8KqVmRLwbIWhA}{}{},{Midas}{66GU3k9BRqGQ2PRAAxnfmQ}{}{},}
[2020-06-15 04:11:21,652][INFO ][cluster.service ] [Midas] removed {{Isis}{2UKej6PQRya4WnEARlEZwA}{}{},}, reason: zen-disco-master_failed ({Isis}{2UKej6PQRya4WnEARlEZwA}{}{})
[2020-06-15 04:11:51,676][INFO ][cluster.service ] [Midas] detected_master {Isis}{2UKej6PQRya4WnEARlEZwA}{}{}, added {{Isis}{2UKej6PQRya4WnEARlEZwA}{}{},}, reason: zen-disco-receive(from master [{Isis}{2UKej6PQRya4WnEARlEZwA}{}{}])
[2020-06-15 04:13:20,776][INFO ][cluster.service ] [Midas] removed {{Stunner}{QNE6jkMyR8KqVmRLwbIWhA}{}{},}, reason: zen-disco-receive(from master [{Isis}{2UKej6PQRya4WnEARlEZwA}{}{}])
[2020-06-15 04:13:50,794][INFO ][cluster.service ] [Midas] added {{Stunner}{QNE6jkMyR8KqVmRLwbIWhA}{}{},}, reason: zen-disco-receive(from master [{Isis}{2UKej6PQRya4WnEARlEZwA}{}{}])
[2020-06-15 05:11:51,660][INFO ][discovery.zen ] [Midas] master_left [{Isis}{2UKej6PQRya4WnEARlEZwA}{}{}], reason [transport disconnected]
[2020-06-15 05:11:51,660][WARN ][discovery.zen ] [Midas] master left (reason = transport disconnected), current nodes: {{Stunner}{QNE6jkMyR8KqVmRLwbIWhA}{}{},{Midas}{66GU3k9BRqGQ2PRAAxnfmQ}{}{},}

I have found some other threads suggesting to reduce the tcp keepalive settings on the nodes, unfortunately, that didn't help in my case.

Does anyone know what causes this and/or what I can do to fix this?


1 Like

Turns out Istio was the culprit. I ignored the transport port in envoy by setting

        "traffic.sidecar.istio.io/excludeInboundPorts": "9300"
        "traffic.sidecar.istio.io/excludeOutboundPorts": "9300"

Now the cluster connection is stable.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.