Data nodes randomly leaving cluster

Hello,

I have a problem that I can't seem to be able to fix nor understand.

I have an Elasticsearch cluster with three master nodes and 6 (12) data nodes.
Master nodes: virtual, with 2CPU and 4GB of memory.
Data nodes: physical machines, with 28CPU and 128GB of RAM memory
Each data node has two docker containers, one for hot instance (SSD storage) and the other for warm (HDD storage).

Elasticsearch version is 7.16.2 in all instances.

Problem
Some days ago, because of a conflict between my ILM and the initial index requirements, data wasn't moved from hot to warm the way it was supposed to be moved and hot instances ran out of space so I started moving manually indexes from hot to warm to free up space and relocate everything so the cluster would leave RED status.

Here's where my problems started. Nodes started to randomly (at least randomly to my eyes) to leave the cluster, with all shards trying to relocate themselves, provoking a cascade of failures. I thought it could be caused by hundreds of shards trying to move around the system, which is not usually what would happen in a normally managed ILM policy based movement of data.

I stopped shards of being relocated and then started to relocate manually little by little, but nodes will randomly leave the cluster when trying to relocate some indexes.

One thing to clarify, the cluster works "properly" if the nodes are not leaving the cluster, I can query through Kibana, and logs are being inserted.

Settings for master nodes

cluster.name: es-pic
network.host: 0.0.0.0
network.publish_host: 192.168.96.170
network.bind_host: 0.0.0.0
node.name: es-master03
node.roles: [ master ]
cluster.initial_master_nodes: es-master01, es-master02, es-master03
discovery.seed_hosts: es-master01.pic.es:9300,es-master02.pic.es:9300,es-master03.pic.es:9300
path.data: /usr/share/elasticsearch/data
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/certs/es_pic_es.key
xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/certs/es_pic_es.crt
xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/certs/SectigoCA.crt
discovery.seed_resolver.timeout: 30s
discovery.seed_resolver.max_concurrent_resolvers: 20
monitor.fs.health.slow_path_logging_threshold: 30s

Settings for hot data nodes

cluster.name: es-pic
network.host: 0.0.0.0
network.publish_host: 192.168.96.127
network.bind_host: 0.0.0.0
node.name: es-data01_hot
node.roles: [ data ]
node.attr.data: hot
http.port: 9200
transport.port: 9300
cluster.initial_master_nodes: es-master01, es-master02, es-master03
discovery.seed_hosts: es-master01.pic.es:9300,es-master02.pic.es:9300,es-master03.pic.es:9300
path.data: /usr/share/elasticsearch/data
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/certs/es_pic_es.key
xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/certs/es_pic_es.crt
xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/certs/SectigoCA.crt
discovery.seed_resolver.timeout: 30s
discovery.seed_resolver.max_concurrent_resolvers: 20
monitor.fs.health.slow_path_logging_threshold: 30s

Settings for warm data nodes

cluster.name: es-pic
network.host: 0.0.0.0
network.publish_host: 192.168.96.127
network.bind_host: 0.0.0.0
node.name: es-data01_warm
node.roles: [ data ]
node.attr.data: warm
http.port: 9201
transport.port: 9301
cluster.initial_master_nodes: es-master01, es-master02, es-master03
discovery.seed_hosts: es-master01.pic.es:9300,es-master02.pic.es:9300,es-master03.pic.es:9300
path.data: /usr/share/elasticsearch/data1,/usr/share/elasticsearch/data2
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/certs/es_pic_es.key
xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/certs/es_pic_es.crt
xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/certs/SectigoCA.crt
discovery.seed_resolver.timeout: 30s
discovery.seed_resolver.max_concurrent_resolvers: 20
monitor.fs.health.slow_path_logging_threshold: 30s

ERRORS
On data node hot (appeared several times at the same time)

Jan 19 15:27:10 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:27:09,877Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "sending transport message [Request{indices:admin/seq_no/global_checkpoint_sync[r]}{171888256}{false}{false}{false}] of size [368] on [Netty4TcpChannel{localAddress=/172.17.0.2:38576, remoteAddress=192.168.96.129/192.168.96.129:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }

Some seconds later from another node

Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,965Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "removed {{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}}, term: 6, version: 196505, reason: ApplyCommitRequest{term=6, version=196505, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag"  }
Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,981Z", "level": "INFO", "component": "o.e.i.s.IndexShard", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": " [dcache_kafka-2021.11.10][0] primary-replica resync completed with 0 operations", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag"  }
Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,982Z", "level": "INFO", "component": "o.e.i.s.IndexShard", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": " [elastiflow-flow-codex-1.2-2021.12.31][0] primary-replica resync completed with 0 operations", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag"  }
Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,985Z", "level": "INFO", "component": "o.e.i.s.IndexShard", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": " [.ds-.logs-deprecation.elasticsearch-default-2022.01.19-000003][0] primary-replica resync completed with 0 operations", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag"  }
Jan 19 15:30:07 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:30:07,058Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "sending transport message [Request{indices:data/write/bulk[s][r]}{100478886}{false}{false}{false}] of size [2904805] on [Netty4TcpChannel{localAddress=/172.17.0.2:33597, remoteAddress=192.168.96.128/192.168.96.128:9300, profile=default}] took [11006ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag"  }
Jan 19 15:30:13 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:30:13,163Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "added {{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}}, term: 6, version: 196511, reason: ApplyCommitRequest{term=6, version=196511, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag"  }
Jan 19 15:30:16 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:30:16,958Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "sending transport message [Request{indices:data/write/bulk[s][r]}{100478672}{false}{false}{false}] of size [2835870] on [Netty4TcpChannel{localAddress=/172.17.0.2:33598, remoteAddress=192.168.96.128/192.168.96.128:9300, profile=default}] took [51428ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag"  }

And from the node that was removed

Jan 19 15:29:52 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:52,978Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "[3] consecutive checks of the master node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}] were unsuccessful ([0] rejected, [3] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: [es-master03][192.168.96.170:9300][internal:coordination/fault_detection/leader_check] request_id [171913802] timed out after [10005ms]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:29:52 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:52,980Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master node changed {previous [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}], current []}, term: 6, version: 196504, reason: becoming candidate: onLeaderFailure", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:29:56 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:56,016Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:29:56 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914018] timed out after [3002ms]"] }
Jan 19 15:29:57 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:57,980Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.170:9300], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:29:57 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master03][192.168.96.170:9300][internal:discovery/request_peers] request_id [171914167] timed out after [3001ms]"] }
Jan 19 15:30:00 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:00,981Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:00 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914191] timed out after [3002ms]"] }
Jan 19 15:30:02 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:02,980Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master not discovered yet: have discovered [{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}, {es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}]; discovery will continue using [192.168.96.168:9300, 192.168.96.169:9300, 192.168.96.170:9300] from hosts providers and [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}] from last-known cluster state; node term 6, last-accepted version 196504 in term 6", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:04 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:04,982Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.170:9300], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:04 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master03][192.168.96.170:9300][internal:discovery/request_peers] request_id [171914247] timed out after [3002ms]"] }
Jan 19 15:30:05 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:05,982Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:05 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914262] timed out after [3002ms]"] }
Jan 19 15:30:07 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:07,733Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [12.8s/12806ms] ago, timed out [9.8s/9805ms] ago, action [internal:discovery/request_peers], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171914167]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:07 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:07,733Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [5.8s/5803ms] ago, timed out [2.8s/2801ms] ago, action [internal:discovery/request_peers], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171914247]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:08 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:08,983Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:08 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914310] timed out after [3001ms]"] }
Jan 19 15:30:11 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:11,984Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:11 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914369] timed out after [3002ms]"] }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,589Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [51.6s/51628ms] ago, timed out [41.6s/41623ms] ago, action [internal:coordination/fault_detection/leader_check], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171912397]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,593Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [40.6s/40622ms] ago, timed out [30.6s/30617ms] ago, action [internal:coordination/fault_detection/leader_check], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171913456]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,594Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [29.6s/29616ms] ago, timed out [19.6s/19611ms] ago, action [internal:coordination/fault_detection/leader_check], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171913802]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,981Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master not discovered yet: have discovered [{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}, {es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}]; discovery will continue using [192.168.96.168:9300, 192.168.96.169:9300, 192.168.96.170:9300] from hosts providers and [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}] from last-known cluster state; node term 6, last-accepted version 196504 in term 6", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:13 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:13,310Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master node changed {previous [], current [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}]}, term: 6, version: 196511, reason: ApplyCommitRequest{term=6, version=196511, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:16,943Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "sending transport message [Request{indices:data/write/bulk[s][r]}{171912798}{false}{false}{false}] of size [3008543] on [Netty4TcpChannel{localAddress=/172.17.0.2:42548, remoteAddress=192.168.96.132/192.168.96.132:9300, profile=default}] took [51427ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:16,951Z", "level": "WARN", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "[[elastiflow-flow-codex-1.2-2022.01.19][1]] failed to perform indices:data/write/bulk[s] on replica [elastiflow-flow-codex-1.2-2022.01.19][1], node[gYHPagfBSHGOdviFJgg71Q], [R], s[STARTED], a[id=xe2rFelfTO2WH7ZMjEMedQ]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-data06_hot][172.17.0.2:9300][indices:data/write/bulk[s][r]]",
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: "Caused by: java.lang.IllegalStateException: [elastiflow-flow-codex-1.2-2022.01.19][1] operation primary term [5] is too old (current [6])",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:16,957Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[elastiflow-flow-codex-1.2-2022.01.19][1]], allocation id [xe2rFelfTO2WH7ZMjEMedQ], primary term [5], message [failed to perform indices:data/write/bulk[s] on replica [elastiflow-flow-codex-1.2-2022.01.19][1], node[gYHPagfBSHGOdviFJgg71Q], [R], s[STARTED], a[id=xe2rFelfTO2WH7ZMjEMedQ]], failure [RemoteTransportException[[es-data06_hot][172.17.0.2:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[elastiflow-flow-codex-1.2-2022.01.19][1] operation primary term [5] is too old (current [6])]; ], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:17 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:17,364Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[dcache_kafka-2022.01.19][0]], allocation id [HD3AFtxXTlGH_BQpRNst6Q], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [dcache_kafka-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=HD3AFtxXTlGH_BQpRNst6Q]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:18 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:18,122Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[syslog-2022.01.19][0]], allocation id [yC5NAAj_RBKqICGwsjseRw], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [syslog-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=yC5NAAj_RBKqICGwsjseRw]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:18 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:18,695Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [12.6s/12606ms] ago, timed out [9.6s/9605ms] ago, action [internal:discovery/request_peers], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], id [171914310]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:30:18 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:18,998Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [26s/26015ms] ago, timed out [23s/23013ms] ago, action [internal:discovery/request_peers], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], id [171914018]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw"  }
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:31:09,952Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "unexpected failure while sending request [internal:cluster/shard/failure] to [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[dcache_kafka-2022.01.19][0]], allocation id [HD3AFtxXTlGH_BQpRNst6Q], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [dcache_kafka-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=HD3AFtxXTlGH_BQpRNst6Q]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master03][172.17.0.2:9300][internal:cluster/shard/failure]",
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [9] did not match current primary term [10]",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:31:09,955Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "unexpected failure while sending request [internal:cluster/shard/failure] to [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[elastiflow-flow-codex-1.2-2022.01.19][1]], allocation id [xe2rFelfTO2WH7ZMjEMedQ], primary term [5], message [failed to perform indices:data/write/bulk[s] on replica [elastiflow-flow-codex-1.2-2022.01.19][1], node[gYHPagfBSHGOdviFJgg71Q], [R], s[STARTED], a[id=xe2rFelfTO2WH7ZMjEMedQ]], failure [RemoteTransportException[[es-data06_hot][172.17.0.2:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[elastiflow-flow-codex-1.2-2022.01.19][1] operation primary term [5] is too old (current [6])]; ], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master03][172.17.0.2:9300][internal:cluster/shard/failure]",
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [5] did not match current primary term [6]",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:31:09,964Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "unexpected failure while sending request [internal:cluster/shard/failure] to [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[syslog-2022.01.19][0]], allocation id [yC5NAAj_RBKqICGwsjseRw], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [syslog-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=yC5NAAj_RBKqICGwsjseRw]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master03][172.17.0.2:9300][internal:cluster/shard/failure]",
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [9] did not match current primary term [10]",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }

And from other nodes (master and data), some seconds before one of the errors, the node was added back to the cluster

Jan 19 15:30:13 es-data03.pic.es docker-es_warm[8304]: {"type": "server", "timestamp": "2022-01-19T14:30:13,163Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data03_warm", "message": "added {{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}}, term: 6, version: 196511, reason: ApplyCommitRequest{term=6, version=196511, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "78dqOfXWR0uZHOc1_st2dg"  }

After that operations resumed, and eventually the indexes were relocated, but it could have happened that a node again would leave the node and start over again. I don't think this is the way this is supposed to work, and having some checks going all the time from yellow to red to (eventually will go) green, but never knowing when this is going to be again red and yellow, doesn't seem optimal for me, and for the small team of people we are.

Honestly, this has been working without a failure until the first storage problems appeared. I'm not sure if there was something else that could affect that I'm no aware of, but this is the maximum I have been able to gather.

As an additional information, I modified the setting for the time left for a node to start moving its shards so the cluster doesn't get completely crazy, because I know that before 10 minutes the node will be back, but I guess this is not a good practice neither.

{
  "settings": {
    "index.unassigned.node_left.delayed_timeout": "10m"
  }
}

So, I don't know if I may have something wrong, if anybody had something similar happening the their cluster in the past, or there is something that rings a bell to anybody, but any help would be hugely appreciated.

Thanks,

Jordi

I would suggest that the first thing you should do is to increase the specs of your master nodes.

With 4 GB of RAM, how many are you using for the Heap? 2 GB? This seems to small for your cluster.

Also, whate are the specs of your data nodes? You said physical machines with 128 GB of RAM, and mentioned containers, but what is the Heap of each node? With 128 GB of Ram you could have easily 2 nodes with 30 GB of Heap each, is this what you have?

You should also share some logs from the master nodes when the nodes leave and rejoin the cluster.

1 Like

Also look at all nodes heap number of garbage collections and time spent in GC.

1 Like

Hi, thanks for your reply!

Yes, my master nodes are virtual machines with 2GB dedicated for heap. I can increase the memory up to 8GB and assign 4GB for heap (as I think one of the recommendations is to give no more more than half of the machine memory and up to 30GB of heap).

Physical machines have two instances both of them with 30GB of heap.

I will try this change and see how it reacts and get masters logs if it fails again.

Thank you!

Hello. Thanks for your answer :slight_smile:

I will look into it too, thanks. I was looking in the monitoring and I couldn't see something weird, but I am not an expert at all on GC and probably wasn't looking at the right place. I will put some more effort on trying to find these numbers and understand them.

Thanks!

Hi Leandro,

I added more memory to the nodes, but nothing changed, error happening the same way.

I collected some lines of log in master and also in garbage collector log.

I hope you see something that rings any bell, because I have looked a lot about the messages that appear but I haven't found anything definitive.

Master
https://pastebin.com/raw/RRJcGjyM

Garbage Collector
https://pastebin.com/raw/XnVYpGvz

Meanwhile I will try to learn what/how the garbage collector works because I lack some knowledge on that.

Thanks!

Hi Len,

I pasted some logs replying to Leandro, some of them about Garbage Collector. As I told him, I will have to learn about Garbage Collector because I don't really understand what logs are telling me.

Thanks!

This thing happened just now, without me trying to relocate any shard. Just because.

Jan 24 17:29:23 es-master02.pic.es docker-es[2023]: {"type": "server", "timestamp": "2022-01-24T16:29:23,848Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-master02", "message": "removed {{es-data02_hot}{5-tUsM9GQZiWMO7fOZtH3Q}{xz2LgFV4RfeDKZkINZTpnQ}{192.168.96.128}{192.168.96.128:9300}{d}}, term: 11, version: 220640, reason: ApplyCommitRequest{term=11, version=220640, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{vDmjVfsLSjq7Tlf-fRdvzQ}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "YVg1_1FXS5qbVBB2VNztoA"  }
Jan 24 17:29:30 es-master02.pic.es docker-es[2023]: {"type": "server", "timestamp": "2022-01-24T16:29:30,598Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-master02", "message": "removed {{es-data02_warm}{kkPmDGCsQ6aSBYzjbOkkuA}{3Epnz_ihRmGueQiJYii26w}{192.168.96.128}{192.168.96.128:9301}{d}}, term: 11, version: 220641, reason: ApplyCommitRequest{term=11, version=220641, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{vDmjVfsLSjq7Tlf-fRdvzQ}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "YVg1_1FXS5qbVBB2VNztoA"  }
Jan 24 17:29:43 es-master02.pic.es docker-es[2023]: {"type": "server", "timestamp": "2022-01-24T16:29:43,736Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-master02", "message": "added {{es-data02_hot}{5-tUsM9GQZiWMO7fOZtH3Q}{xz2LgFV4RfeDKZkINZTpnQ}{192.168.96.128}{192.168.96.128:9300}{d}}, term: 11, version: 220644, reason: ApplyCommitRequest{term=11, version=220644, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{vDmjVfsLSjq7Tlf-fRdvzQ}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "YVg1_1FXS5qbVBB2VNztoA"  }
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: {"type": "server", "timestamp": "2022-01-24T16:29:54,749Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-master02", "message": "send message failed [channel: Netty4TcpChannel{localAddress=/172.17.0.2:59408, remoteAddress=192.168.96.128/192.168.96.128:9300, profile=default}]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "YVg1_1FXS5qbVBB2VNztoA" ,
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "stacktrace": ["io.netty.handler.ssl.SslHandshakeTimeoutException: handshake timed out after 10000ms",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2114) [netty-handler-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [netty-common-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) [netty-common-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [netty-common-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) [netty-common-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) [netty-transport-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [netty-common-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.66.Final.jar:4.1.66.Final]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: {"type": "server", "timestamp": "2022-01-24T16:29:54,752Z", "level": "WARN", "component": "o.e.c.NodeConnectionsService", "cluster.name": "es-pic", "node.name": "es-master02", "message": "failed to connect to {es-data02_hot}{5-tUsM9GQZiWMO7fOZtH3Q}{xz2LgFV4RfeDKZkINZTpnQ}{192.168.96.128}{192.168.96.128:9300}{d}{data=hot, xpack.installed=true, transform.node=false} (tried [1] times)", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "YVg1_1FXS5qbVBB2VNztoA" ,
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "stacktrace": ["org.elasticsearch.transport.ConnectTransportException: [es-data02_hot][192.168.96.128:9300] general node connection failure",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.lambda$onResponse$2(TcpTransport.java:1035) ~[elasticsearch-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:144) ~[elasticsearch-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.transport.TransportHandshaker$HandshakeResponseHandler.handleLocalException(TransportHandshaker.java:155) ~[elasticsearch-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.transport.TransportHandshaker.lambda$sendHandshake$0(TransportHandshaker.java:52) ~[elasticsearch-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:241) ~[elasticsearch-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$0(ActionListener.java:277) ~[elasticsearch-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.core.CompletableContext.lambda$addListener$0(CompletableContext.java:28) ~[elasticsearch-core-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.core.CompletableContext.complete(CompletableContext.java:50) ~[elasticsearch-core-7.16.2.jar:7.16.2]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$addListener$0(Netty4TcpChannel.java:51) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:571) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:550) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:605) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1182) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:773) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:749) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:620) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1352) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:622) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:606) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:472) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.handler.ssl.SslUtils.handleHandshakeFailure(SslUtils.java:440) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.handler.ssl.SslHandler$7.run(SslHandler.java:2118) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "at java.lang.Thread.run(Thread.java:833) [?:?]",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "Caused by: org.elasticsearch.transport.TransportException: handshake failed because connection reset",
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: "... 36 more"] }
Jan 24 17:29:54 es-master02.pic.es docker-es[2023]: {"type": "server", "timestamp": "2022-01-24T16:29:54,757Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-master02", "message": "sending transport message [Request{internal:tcp/handshake}{850273}{false}{false}{true}] of size [58] on [Netty4TcpChannel{localAddress=/172.17.0.2:59408, remoteAddress=192.168.96.128/192.168.96.128:9300, profile=default}] took [10008ms] which is above the warn threshold of [5000ms] with success [false]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "YVg1_1FXS5qbVBB2VNztoA"  }
Jan 24 17:30:14 es-master02.pic.es docker-es[2023]: {"type": "server", "timestamp": "2022-01-24T16:30:14,257Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-master02", "message": "added {{es-data02_warm}{kkPmDGCsQ6aSBYzjbOkkuA}{3Epnz_ihRmGueQiJYii26w}{192.168.96.128}{192.168.96.128:9301}{d}}, term: 11, version: 220645, reason: ApplyCommitRequest{term=11, version=220645, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{vDmjVfsLSjq7Tlf-fRdvzQ}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "YVg1_1FXS5qbVBB2VNztoA"  }

The reason the nodes left is followers check retry count exceeded [timeouts=3, failures=0], i.e. health checks timed out 3 times in a row as if each node wasn't responding for ≥30s. I don't see any big pauses in the GC logs so that doesn't seem a likely explanation, although you should check the logs on all nodes. One possibility is a network outage (or congestion). Another is some kind of temporary overload on one or other node, although I'd expect to see noise in the logs before a node-left event in that case. The master seems to be coordinating writes (i.e. sending indices:data/write/bulk[s] messages) which tends to be bad for stability.

Hmm that's suspicious, were you finding this health check was reporting a lot of timeouts with the default threshold of 5s? If so, that's pretty bad, the health check only writes a few bytes to disk which should take milliseconds at worst.

Could you share the output of GET _cluster/settings?flat_settings too please?

Hi David,

Thanks your for two responses.

The network issue it is something I have been thinking, but we can't detect any changes that could affect, and the problem only happened once we started moving shards. It may be a coincidence in time and really something changed, but we have to still look into this path.

For the monitor.fs.health.slow_path_logging_threshold option, I really added it from desperation :slight_smile:
It would always appear a line saying that a threshold of 5000ms was passed and couldn't find the option that was. I added this but nothing changed. Same reason for discovery.seed_resolver.timeout which in one of my theories of the problem being a network thing, maybe it would take longer to resolve the hostname and hence failing, but I'm kind of lost.

Here is the output for the settings call

{
  "persistent" : { },
  "transient" : { }
}

Thanks!

The warnings for slow writes are there to help you so I don't recommend just suppressing them. Nothing is actually failing, it's just telling you that your disks appear to be desperately slow which will likely cause other problems.

If your network is performing poorly then you might not notice until the extra load of shard movements pushes it over the edge.

You could try reducing indices.recovery.max_bytes_per_sec which defaults to a fairly conservative 40MB/s but maybe your system can't even deal with that much recovery bandwidth.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.