ES loses quorum every hour

We have our clusters running for a few years now on 7.17 with java8. Due to company requirements, I upgraded the cluster to use java17. We use our own java binary due to company policies. Once the cluster was on java17, we noticed master losing a random data node every hour causing unassigned shards, turning yellow and then recovering. Due to other reasons and this flapping, we reverted java17 to java8. One would image the issue would go away but it is worse now - master loses quorum along with data node disconnects every hour!

We have 3 nodes of each type - 3*master, 3*ingest, 3*data running on Kubernetes. Initially I thought it might have something to do with istio-proxy timing out but I did not find any relevant logs indicating that. I enabled debug on the cluster and I see master saying the data node left so its marking the shards as unassigned followed by the other 2 master nodes leaving causing quorum failure and turned into red. I cant for the life of me find a log entry that says why the master thinks the nodes left. There are no pod/container restarts, no mem/CPU issues, no OOM errors.

I see these msgs on my nodes but looks like they are harmless:

"stacktrace": ["java.nio.file.NoSuchFileException: /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb62b9ae8_5164_44b2_a6a2_6b7c563bc5f3.slice/cri-containerd-25bbd115796cded679a405e64a8566f9d4ad10029459d3f73d1d0e7ec19a69a5.scope/cpuacct.usage",

All this started after an upgrade to java17 and I am at a loss on how to resolve this. Any pointers would be appreciated! Thank you!

7.17 is really old and no longer supported or maintained so you need to upgrade as a matter of urgency. The maintained versions all have much better troubleshooting for this kind of thing.

Absent more information about your case, but based on other encounters with Istio, this would be my guess too. Elasticsearch requires TCP connections between nodes to remain open basically forever, but Istio has other opinions and will actively interfere with these connections which will cause master failovers and all sorts of other problems.

From here:

idleTimeout: The idle timeout for TCP connections. The idle timeout is defined as the period in which there are no bytes sent or received on either the upstream or downstream connection. If not set, the default idle timeout is 1 hour

A bit suggestive :wink:

Very strange that you linked this to java8 vs java17 - unlikely IMHO that this would be connected. Unless the introduction or reconfig or ... of the Istio proxy was close to the to first use of java17 of course.

1 Like

Yeah that’s the badger.

In particular, this means that it ignores TCP keepalives.

Makes you wonder what sequence of events led to this being seen as a good idea.

1 Like