This is really strange.... I have several ES clusters that was originally at 6.3, then 6.8.... There's various sources of logs flowing in (Beats, logstash, etc...) They are all at various versions between version 6 and 7. I am using docker, but just using AWS and its userdata and mapped volumes to run them on i3.xlarges, along with m5.larges as dedicated masters.
Overtime, I eventually did a rolling update on all ES clusters from 6.8 to 7.6. They all work fine, even with pre-7 beats still shipping logs to it. However, on one particular cluster upgrade, I messed up and completely blew up the cluster.... Therefore, I started over and rebuilt it from scratch with the same topology, just building it straight from 7.6.0 . However, what I am seeing on this brand new cluster is that... after a few days, I see a ton the 30 second timeouts all over... The CPU does get elevated, and data not being indexed... the timeouts all pointing to the node that's elected as the master... The cluster status will be green... but sometimes it's yellow because a newly rolled over cluster... I still do them by dates.... would not be able to initialize properly and would stay stuck at yellow, unable to create a replica and the document count stuck at zero.
If I kill the master node, then an immediate election takes place and that solves the issue... but for several days.... But it happens again...
I'm stumped as to what it can be... I have looked at cluster health, it looks fine.... The only thing that kinda catches my eye is that, when the cluster was unresponsive, on the hot_threads api, the master node was spending quite a bit of CPU on