Hello,
I have a problem that I can't seem to be able to fix nor understand.
I have an Elasticsearch cluster with three master nodes and 6 (12) data nodes.
Master nodes: virtual, with 2CPU and 4GB of memory.
Data nodes: physical machines, with 28CPU and 128GB of RAM memory
Each data node has two docker containers, one for hot instance (SSD storage) and the other for warm (HDD storage).
Elasticsearch version is 7.16.2 in all instances.
Problem
Some days ago, because of a conflict between my ILM and the initial index requirements, data wasn't moved from hot to warm the way it was supposed to be moved and hot instances ran out of space so I started moving manually indexes from hot to warm to free up space and relocate everything so the cluster would leave RED status.
Here's where my problems started. Nodes started to randomly (at least randomly to my eyes) to leave the cluster, with all shards trying to relocate themselves, provoking a cascade of failures. I thought it could be caused by hundreds of shards trying to move around the system, which is not usually what would happen in a normally managed ILM policy based movement of data.
I stopped shards of being relocated and then started to relocate manually little by little, but nodes will randomly leave the cluster when trying to relocate some indexes.
One thing to clarify, the cluster works "properly" if the nodes are not leaving the cluster, I can query through Kibana, and logs are being inserted.
Settings for master nodes
cluster.name: es-pic
network.host: 0.0.0.0
network.publish_host: 192.168.96.170
network.bind_host: 0.0.0.0
node.name: es-master03
node.roles: [ master ]
cluster.initial_master_nodes: es-master01, es-master02, es-master03
discovery.seed_hosts: es-master01.pic.es:9300,es-master02.pic.es:9300,es-master03.pic.es:9300
path.data: /usr/share/elasticsearch/data
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/certs/es_pic_es.key
xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/certs/es_pic_es.crt
xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/certs/SectigoCA.crt
discovery.seed_resolver.timeout: 30s
discovery.seed_resolver.max_concurrent_resolvers: 20
monitor.fs.health.slow_path_logging_threshold: 30s
Settings for hot data nodes
cluster.name: es-pic
network.host: 0.0.0.0
network.publish_host: 192.168.96.127
network.bind_host: 0.0.0.0
node.name: es-data01_hot
node.roles: [ data ]
node.attr.data: hot
http.port: 9200
transport.port: 9300
cluster.initial_master_nodes: es-master01, es-master02, es-master03
discovery.seed_hosts: es-master01.pic.es:9300,es-master02.pic.es:9300,es-master03.pic.es:9300
path.data: /usr/share/elasticsearch/data
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/certs/es_pic_es.key
xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/certs/es_pic_es.crt
xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/certs/SectigoCA.crt
discovery.seed_resolver.timeout: 30s
discovery.seed_resolver.max_concurrent_resolvers: 20
monitor.fs.health.slow_path_logging_threshold: 30s
Settings for warm data nodes
cluster.name: es-pic
network.host: 0.0.0.0
network.publish_host: 192.168.96.127
network.bind_host: 0.0.0.0
node.name: es-data01_warm
node.roles: [ data ]
node.attr.data: warm
http.port: 9201
transport.port: 9301
cluster.initial_master_nodes: es-master01, es-master02, es-master03
discovery.seed_hosts: es-master01.pic.es:9300,es-master02.pic.es:9300,es-master03.pic.es:9300
path.data: /usr/share/elasticsearch/data1,/usr/share/elasticsearch/data2
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/certs/es_pic_es.key
xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/certs/es_pic_es.crt
xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/certs/SectigoCA.crt
discovery.seed_resolver.timeout: 30s
discovery.seed_resolver.max_concurrent_resolvers: 20
monitor.fs.health.slow_path_logging_threshold: 30s
ERRORS
On data node hot (appeared several times at the same time)
Jan 19 15:27:10 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:27:09,877Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "sending transport message [Request{indices:admin/seq_no/global_checkpoint_sync[r]}{171888256}{false}{false}{false}] of size [368] on [Netty4TcpChannel{localAddress=/172.17.0.2:38576, remoteAddress=192.168.96.129/192.168.96.129:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Some seconds later from another node
Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,965Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "removed {{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}}, term: 6, version: 196505, reason: ApplyCommitRequest{term=6, version=196505, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag" }
Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,981Z", "level": "INFO", "component": "o.e.i.s.IndexShard", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": " [dcache_kafka-2021.11.10][0] primary-replica resync completed with 0 operations", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag" }
Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,982Z", "level": "INFO", "component": "o.e.i.s.IndexShard", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": " [elastiflow-flow-codex-1.2-2021.12.31][0] primary-replica resync completed with 0 operations", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag" }
Jan 19 15:29:52 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:29:52,985Z", "level": "INFO", "component": "o.e.i.s.IndexShard", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": " [.ds-.logs-deprecation.elasticsearch-default-2022.01.19-000003][0] primary-replica resync completed with 0 operations", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag" }
Jan 19 15:30:07 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:30:07,058Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "sending transport message [Request{indices:data/write/bulk[s][r]}{100478886}{false}{false}{false}] of size [2904805] on [Netty4TcpChannel{localAddress=/172.17.0.2:33597, remoteAddress=192.168.96.128/192.168.96.128:9300, profile=default}] took [11006ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag" }
Jan 19 15:30:13 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:30:13,163Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "added {{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}}, term: 6, version: 196511, reason: ApplyCommitRequest{term=6, version=196511, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag" }
Jan 19 15:30:16 es-data04.pic.es docker-es_hot[27159]: {"type": "server", "timestamp": "2022-01-19T14:30:16,958Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data04_hot", "message": "sending transport message [Request{indices:data/write/bulk[s][r]}{100478672}{false}{false}{false}] of size [2835870] on [Netty4TcpChannel{localAddress=/172.17.0.2:33598, remoteAddress=192.168.96.128/192.168.96.128:9300, profile=default}] took [51428ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "mDnM3wmBQB2AxSj00waaag" }
And from the node that was removed
Jan 19 15:29:52 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:52,978Z", "level": "INFO", "component": "o.e.c.c.Coordinator", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "[3] consecutive checks of the master node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}] were unsuccessful ([0] rejected, [3] timed out), restarting discovery; more details may be available in the master node logs [last unsuccessful check: [es-master03][192.168.96.170:9300][internal:coordination/fault_detection/leader_check] request_id [171913802] timed out after [10005ms]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:29:52 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:52,980Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master node changed {previous [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}], current []}, term: 6, version: 196504, reason: becoming candidate: onLeaderFailure", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:29:56 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:56,016Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:29:56 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914018] timed out after [3002ms]"] }
Jan 19 15:29:57 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:29:57,980Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.170:9300], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:29:57 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master03][192.168.96.170:9300][internal:discovery/request_peers] request_id [171914167] timed out after [3001ms]"] }
Jan 19 15:30:00 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:00,981Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:00 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914191] timed out after [3002ms]"] }
Jan 19 15:30:02 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:02,980Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master not discovered yet: have discovered [{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}, {es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}]; discovery will continue using [192.168.96.168:9300, 192.168.96.169:9300, 192.168.96.170:9300] from hosts providers and [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}] from last-known cluster state; node term 6, last-accepted version 196504 in term 6", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:04 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:04,982Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.170:9300], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:04 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master03][192.168.96.170:9300][internal:discovery/request_peers] request_id [171914247] timed out after [3002ms]"] }
Jan 19 15:30:05 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:05,982Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:05 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914262] timed out after [3002ms]"] }
Jan 19 15:30:07 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:07,733Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [12.8s/12806ms] ago, timed out [9.8s/9805ms] ago, action [internal:discovery/request_peers], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171914167]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:07 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:07,733Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [5.8s/5803ms] ago, timed out [2.8s/2801ms] ago, action [internal:discovery/request_peers], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171914247]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:08 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:08,983Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:08 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914310] timed out after [3001ms]"] }
Jan 19 15:30:11 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:11,984Z", "level": "WARN", "component": "o.e.d.PeerFinder", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "address [192.168.96.169:9300], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], requesting [false] peers request failed", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:11 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.ReceiveTimeoutTransportException: [es-master02][192.168.96.169:9300][internal:discovery/request_peers] request_id [171914369] timed out after [3002ms]"] }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,589Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [51.6s/51628ms] ago, timed out [41.6s/41623ms] ago, action [internal:coordination/fault_detection/leader_check], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171912397]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,593Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [40.6s/40622ms] ago, timed out [30.6s/30617ms] ago, action [internal:coordination/fault_detection/leader_check], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171913456]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,594Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [29.6s/29616ms] ago, timed out [19.6s/19611ms] ago, action [internal:coordination/fault_detection/leader_check], node [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}], id [171913802]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:12 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:12,981Z", "level": "WARN", "component": "o.e.c.c.ClusterFormationFailureHelper", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master not discovered yet: have discovered [{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}, {es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}]; discovery will continue using [192.168.96.168:9300, 192.168.96.169:9300, 192.168.96.170:9300] from hosts providers and [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}, {es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}, {es-master01}{wLE3jl04QqC5IJIvVpkjjg}{f9MoIlfHRZWG8GKidub9Lg}{192.168.96.168}{192.168.96.168:9300}{m}] from last-known cluster state; node term 6, last-accepted version 196504 in term 6", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:13 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:13,310Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "master node changed {previous [], current [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}]}, term: 6, version: 196511, reason: ApplyCommitRequest{term=6, version=196511, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:16,943Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "sending transport message [Request{indices:data/write/bulk[s][r]}{171912798}{false}{false}{false}] of size [3008543] on [Netty4TcpChannel{localAddress=/172.17.0.2:42548, remoteAddress=192.168.96.132/192.168.96.132:9300, profile=default}] took [51427ms] which is above the warn threshold of [5000ms] with success [true]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:16,951Z", "level": "WARN", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "[[elastiflow-flow-codex-1.2-2022.01.19][1]] failed to perform indices:data/write/bulk[s] on replica [elastiflow-flow-codex-1.2-2022.01.19][1], node[gYHPagfBSHGOdviFJgg71Q], [R], s[STARTED], a[id=xe2rFelfTO2WH7ZMjEMedQ]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-data06_hot][172.17.0.2:9300][indices:data/write/bulk[s][r]]",
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: "Caused by: java.lang.IllegalStateException: [elastiflow-flow-codex-1.2-2022.01.19][1] operation primary term [5] is too old (current [6])",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Jan 19 15:30:16 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:16,957Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[elastiflow-flow-codex-1.2-2022.01.19][1]], allocation id [xe2rFelfTO2WH7ZMjEMedQ], primary term [5], message [failed to perform indices:data/write/bulk[s] on replica [elastiflow-flow-codex-1.2-2022.01.19][1], node[gYHPagfBSHGOdviFJgg71Q], [R], s[STARTED], a[id=xe2rFelfTO2WH7ZMjEMedQ]], failure [RemoteTransportException[[es-data06_hot][172.17.0.2:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[elastiflow-flow-codex-1.2-2022.01.19][1] operation primary term [5] is too old (current [6])]; ], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:17 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:17,364Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[dcache_kafka-2022.01.19][0]], allocation id [HD3AFtxXTlGH_BQpRNst6Q], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [dcache_kafka-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=HD3AFtxXTlGH_BQpRNst6Q]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:18 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:18,122Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "no master known for action [internal:cluster/shard/failure] for shard entry [shard id [[syslog-2022.01.19][0]], allocation id [yC5NAAj_RBKqICGwsjseRw], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [syslog-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=yC5NAAj_RBKqICGwsjseRw]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:18 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:18,695Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [12.6s/12606ms] ago, timed out [9.6s/9605ms] ago, action [internal:discovery/request_peers], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], id [171914310]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:30:18 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:30:18,998Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "Received response for a request that has timed out, sent [26s/26015ms] ago, timed out [23s/23013ms] ago, action [internal:discovery/request_peers], node [{es-master02}{YVg1_1FXS5qbVBB2VNztoA}{cFPJq2anQWCh68Yh6kOZTg}{192.168.96.169}{192.168.96.169:9300}{m}{xpack.installed=true, transform.node=false}], id [171914018]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" }
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:31:09,952Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "unexpected failure while sending request [internal:cluster/shard/failure] to [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[dcache_kafka-2022.01.19][0]], allocation id [HD3AFtxXTlGH_BQpRNst6Q], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [dcache_kafka-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=HD3AFtxXTlGH_BQpRNst6Q]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master03][172.17.0.2:9300][internal:cluster/shard/failure]",
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [9] did not match current primary term [10]",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:31:09,955Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "unexpected failure while sending request [internal:cluster/shard/failure] to [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[elastiflow-flow-codex-1.2-2022.01.19][1]], allocation id [xe2rFelfTO2WH7ZMjEMedQ], primary term [5], message [failed to perform indices:data/write/bulk[s] on replica [elastiflow-flow-codex-1.2-2022.01.19][1], node[gYHPagfBSHGOdviFJgg71Q], [R], s[STARTED], a[id=xe2rFelfTO2WH7ZMjEMedQ]], failure [RemoteTransportException[[es-data06_hot][172.17.0.2:9300][indices:data/write/bulk[s][r]]]; nested: IllegalStateException[[elastiflow-flow-codex-1.2-2022.01.19][1] operation primary term [5] is too old (current [6])]; ], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master03][172.17.0.2:9300][internal:cluster/shard/failure]",
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [5] did not match current primary term [6]",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: {"type": "server", "timestamp": "2022-01-19T14:31:09,964Z", "level": "WARN", "component": "o.e.c.a.s.ShardStateAction", "cluster.name": "es-pic", "node.name": "es-data05_hot", "message": "unexpected failure while sending request [internal:cluster/shard/failure] to [{es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}] for shard entry [shard id [[syslog-2022.01.19][0]], allocation id [yC5NAAj_RBKqICGwsjseRw], primary term [9], message [failed to perform indices:data/write/bulk[s] on replica [syslog-2022.01.19][0], node[zdZCJSKSRDaBbVijIHkIAA], [R], s[STARTED], a[id=yC5NAAj_RBKqICGwsjseRw]], failure [IndexShardClosedException[CurrentState[CLOSED] Primary closed.]], markAsStale [true]]", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "Q0IH51fZSM2SVGzKUjP3Mw" ,
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-master03][172.17.0.2:9300][internal:cluster/shard/failure]",
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "Caused by: org.elasticsearch.cluster.action.shard.ShardStateAction$NoLongerPrimaryShardException: primary term [9] did not match current primary term [10]",
############################
REST OF LONG JAVA STACKTRACE
############################
Jan 19 15:31:09 es-data05.pic.es docker-es_hot[20983]: "at java.lang.Thread.run(Thread.java:833) [?:?]"] }
And from other nodes (master and data), some seconds before one of the errors, the node was added back to the cluster
Jan 19 15:30:13 es-data03.pic.es docker-es_warm[8304]: {"type": "server", "timestamp": "2022-01-19T14:30:13,163Z", "level": "INFO", "component": "o.e.c.s.ClusterApplierService", "cluster.name": "es-pic", "node.name": "es-data03_warm", "message": "added {{es-data05_hot}{Q0IH51fZSM2SVGzKUjP3Mw}{dRPTMsj5Q-mAYGwk0S0FOQ}{192.168.96.131}{192.168.96.131:9300}{d}}, term: 6, version: 196511, reason: ApplyCommitRequest{term=6, version=196511, sourceNode={es-master03}{ga-b31a7QKasdkq513287Q}{Q8pPKsOMT3STBzECGNehbA}{192.168.96.170}{192.168.96.170:9300}{m}{xpack.installed=true, transform.node=false}}", "cluster.uuid": "6pn9EQ4WR1qKcZ4b0mgAsw", "node.id": "78dqOfXWR0uZHOc1_st2dg" }
After that operations resumed, and eventually the indexes were relocated, but it could have happened that a node again would leave the node and start over again. I don't think this is the way this is supposed to work, and having some checks going all the time from yellow to red to (eventually will go) green, but never knowing when this is going to be again red and yellow, doesn't seem optimal for me, and for the small team of people we are.
Honestly, this has been working without a failure until the first storage problems appeared. I'm not sure if there was something else that could affect that I'm no aware of, but this is the maximum I have been able to gather.
As an additional information, I modified the setting for the time left for a node to start moving its shards so the cluster doesn't get completely crazy, because I know that before 10 minutes the node will be back, but I guess this is not a good practice neither.
{
"settings": {
"index.unassigned.node_left.delayed_timeout": "10m"
}
}
So, I don't know if I may have something wrong, if anybody had something similar happening the their cluster in the past, or there is something that rings a bell to anybody, but any help would be hugely appreciated.
Thanks,
Jordi