Nodes constantly losing connection

I have a 2-node cluster(archive and main) and today they start losing connection.
These are the two logs that keep appearing:
[2021-01-27T10:26:55,816][WARN ][o.e.g.PersistedClusterStateService] [XXXarchive.XXX.local] writing cluster state took [69043ms] which is above the warn threshold of [10s]; wrote global metadata [false] and metadata for [124] indices and skipped [127] unchanged indices [2021-01-27T10:28:11,586][WARN ][o.e.c.c.C.CoordinatorPublication] [XXX-main.XXX.local] after [30s] publication of cluster state version [95968] is still waiting for {XXXarchive.XXX.local}{cPA9OW7KQhKGQ-_xtXGnhg}{y-JIm7z-QWO8NhP2fdwTBQ}{XXX-archive.XXX.local}{X.X.X.103:9300}{dilmrt}{ml.machine_memory=8201244672, ml.max_open_jobs=20, xpack.installed=true, box_type=warm, transform.node=true} [SENT_PUBLISH_REQUEST]
I have around 140 indices and 160 shards.

It looks like your cluster is overloaded and/or has far too slow storage. What is the specification of your cluster? What type of storage are you using?

1 Like

I'm using a SAN HDD on main node and an NFS disk on the second(archive) node. I know they're pretty slow but I've never had this problem in the past 3 months.

Here are the putput of iostat on both nodes:
main:


archive:

Could the high iowait percentage be a problem?

That does not mean that it is not a problem now.

Yes. It looks like this node is struggling, at least periodically.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.