New Elasticsearch 7.6.0 cluster eventually becomes unresponsive

vchan2002 · March 13, 2020, 6:40am

This is really strange.... I have several ES clusters that was originally at 6.3, then 6.8.... There's various sources of logs flowing in (Beats, logstash, etc...) They are all at various versions between version 6 and 7. I am using docker, but just using AWS and its userdata and mapped volumes to run them on i3.xlarges, along with m5.larges as dedicated masters.

Overtime, I eventually did a rolling update on all ES clusters from 6.8 to 7.6. They all work fine, even with pre-7 beats still shipping logs to it. However, on one particular cluster upgrade, I messed up and completely blew up the cluster.... Therefore, I started over and rebuilt it from scratch with the same topology, just building it straight from 7.6.0 . However, what I am seeing on this brand new cluster is that... after a few days, I see a ton the 30 second timeouts all over... The CPU does get elevated, and data not being indexed... the timeouts all pointing to the node that's elected as the master... The cluster status will be green... but sometimes it's yellow because a newly rolled over cluster... I still do them by dates.... would not be able to initialize properly and would stay stuck at yellow, unable to create a replica and the document count stuck at zero.

If I kill the master node, then an immediate election takes place and that solves the issue... but for several days.... But it happens again...

I'm stumped as to what it can be... I have looked at cluster health, it looks fine.... The only thing that kinda catches my eye is that, when the cluster was unresponsive, on the hot_threads api, the master node was spending quite a bit of CPU on

[masterService#updateTask][T#1]'

spinscale · March 13, 2020, 4:32pm

can you share a full hot_threads output somewhere in a gist?

vchan2002 · March 16, 2020, 3:39pm

Unfortunately, given the urgency of the issue, I had no choice but to blow the cluster away and start over.... However, unlike the last time around, I tried to reproduce my upgrade path by.

a. First installing a 6.8 cluster.
b. When the input sources start shipping logs into the ES cluster, do a rolling upgrade to the ES 7.6.

It seems that the the issue no longer persist if I go that route. It would seem that, if pre v7 beats start shipping to V7 elasticsearch cluster that's built from scratch, it would slowly... kill the cluster... but it doesn't do that if it was upgraded from 6.8? It's really strange.... However, it will make me make the push to get all the Beats versions to V7.

system · April 13, 2020, 3:39pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
ES cluster becomes unresponsive Elasticsearch	2	696	July 6, 2017
Cluster has become unresponsive Elasticsearch	9	1216	February 21, 2019
Cluster Becomes Unresponsive for 90 Sec After Data Node Leaves Elasticsearch	2	808	March 3, 2017
Elasticsearch 1.5.2 master unresponsive Elasticsearch	1	396	July 6, 2017
ES 5.4.1: Totally random cluster stalling (100% CPU) about 1-2 times per day: We're out of ideas Elasticsearch	8	1266	July 21, 2017

New Elasticsearch 7.6.0 cluster eventually becomes unresponsive

Related topics