New Elasticsearch 7.6.0 cluster eventually becomes unresponsive

This is really strange.... I have several ES clusters that was originally at 6.3, then 6.8.... There's various sources of logs flowing in (Beats, logstash, etc...) They are all at various versions between version 6 and 7. I am using docker, but just using AWS and its userdata and mapped volumes to run them on i3.xlarges, along with m5.larges as dedicated masters.

Overtime, I eventually did a rolling update on all ES clusters from 6.8 to 7.6. They all work fine, even with pre-7 beats still shipping logs to it. However, on one particular cluster upgrade, I messed up and completely blew up the cluster.... Therefore, I started over and rebuilt it from scratch with the same topology, just building it straight from 7.6.0 . However, what I am seeing on this brand new cluster is that... after a few days, I see a ton the 30 second timeouts all over... The CPU does get elevated, and data not being indexed... the timeouts all pointing to the node that's elected as the master... The cluster status will be green... but sometimes it's yellow because a newly rolled over cluster... I still do them by dates.... would not be able to initialize properly and would stay stuck at yellow, unable to create a replica and the document count stuck at zero.

If I kill the master node, then an immediate election takes place and that solves the issue... but for several days.... But it happens again...

I'm stumped as to what it can be... I have looked at cluster health, it looks fine.... The only thing that kinda catches my eye is that, when the cluster was unresponsive, on the hot_threads api, the master node was spending quite a bit of CPU on

[masterService#updateTask][T#1]'

can you share a full hot_threads output somewhere in a gist?

Unfortunately, given the urgency of the issue, I had no choice but to blow the cluster away and start over.... However, unlike the last time around, I tried to reproduce my upgrade path by.

a. First installing a 6.8 cluster.
b. When the input sources start shipping logs into the ES cluster, do a rolling upgrade to the ES 7.6.

It seems that the the issue no longer persist if I go that route. It would seem that, if pre v7 beats start shipping to V7 elasticsearch cluster that's built from scratch, it would slowly... kill the cluster... but it doesn't do that if it was upgraded from 6.8? It's really strange.... However, it will make me make the push to get all the Beats versions to V7.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.