We have our clusters running for a few years now on 7.17 with java8. Due to company requirements, I upgraded the cluster to use java17. We use our own java binary due to company policies. Once the cluster was on java17, we noticed master losing a random data node every hour causing unassigned shards, turning yellow and then recovering. Due to other reasons and this flapping, we reverted java17 to java8. One would image the issue would go away but it is worse now - master loses quorum along with data node disconnects every hour!
We have 3 nodes of each type - 3*master, 3*ingest, 3*data running on Kubernetes. Initially I thought it might have something to do with istio-proxy timing out but I did not find any relevant logs indicating that. I enabled debug on the cluster and I see master saying the data node left so its marking the shards as unassigned followed by the other 2 master nodes leaving causing quorum failure and turned into red. I cant for the life of me find a log entry that says why the master thinks the nodes left. There are no pod/container restarts, no mem/CPU issues, no OOM errors.
I see these msgs on my nodes but looks like they are harmless:
"stacktrace": ["java.nio.file.NoSuchFileException: /sys/fs/cgroup/cpuacct/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podb62b9ae8_5164_44b2_a6a2_6b7c563bc5f3.slice/cri-containerd-25bbd115796cded679a405e64a8566f9d4ad10029459d3f73d1d0e7ec19a69a5.scope/cpuacct.usage",
All this started after an upgrade to java17 and I am at a loss on how to resolve this. Any pointers would be appreciated! Thank you!