Warm nodes on spinning disks not stable, falling out of cluster during recovery

Hi,

our warms nodes are falling out of cluster while in recovery, and then start recovery from zero. Hot nodes are ok.

Env:
4 servers, on each (256GB ram, 8 cpu) : 1 hot node, 64GB, ssd, 30 days of data, heavy input from logstash 5-10k/s. 1 warm node, 32GB heap, spinning disks, data after 30 days. 6k indices, 20k shards.

Warm nodes are falling out here and there. It is possible to bring cluster to yellow state with:

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_recoveries" : "0",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "40"

but after enabling concurrent_recoveries to 8 or even 4 warm nodes begin to fall out of the cluster every 2-3h, heavy load on spinning disks.

extra settings:

cluster.publish.info_timeout: 40s
cluster.publish.timeout: 120s
cluster.follower_lag.timeout: 300s
discovery.zen.fd.ping_timeout : 111s
discovery.zen.fd.ping_retries : 5

status:

  "cluster_name" : "server-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 8,
  "active_primary_shards" : 6775,
  "active_shards" : 9001,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 11281,
  "delayed_unassigned_shards" : 158,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 44.37925253919732

Please advice with other timeout or ... settings, thanks

log from master node:

[2020-09-27T11:01:14,432][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411159, reason: ApplyCommitRequest{term=64, version=411159, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:11,207][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411161, reason: ApplyCommitRequest{term=64, version=411161, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:15,982][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411162, reason: ApplyCommitRequest{term=64, version=411162, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:23,024][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411165, reason: ApplyCommitRequest{term=64, version=411165, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:26,112][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411166, reason: ApplyCommitRequest{term=64, version=411166, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:29,865][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411168, reason: ApplyCommitRequest{term=64, version=411168, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:32,071][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411169, reason: ApplyCommitRequest{term=64, version=411169, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:40,727][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411171, reason: ApplyCommitRequest{term=64, version=411171, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:41,762][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411172, reason: ApplyCommitRequest{term=64, version=411172, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:44,702][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411174, reason: ApplyCommitRequest{term=64, version=411174, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:48,272][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411175, reason: ApplyCommitRequest{term=64, version=411175, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:51,359][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411176, reason: ApplyCommitRequest{term=64, version=411176, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:54,992][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411177, reason: ApplyCommitRequest{term=64, version=411177, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:04:00,065][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411179, reason: ApplyCommitRequest{term=64, version=411179, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:04:07,309][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [11794ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6770] unchanged indices
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_initial_primaries_recoveries] from [10] to [20]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [4]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [4]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [indices.recovery.max_bytes_per_sec] from [200mb] to [500mb]

log from warm node https://pastebin.pl/view/5f4990c8

You are massively over sharded, I would start by shrinking or deleting some indices.

Hi Mark,

thanks for answer. We'll do that, but how to bring cluster to green first

with this extra settings (default*3) from last post:

transport.connect_timeout: 90s
transport.ping_schedule: 5s
cluster.fault_detection.follower_check.interval: 3s
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.retry_count: 9
cluster.fault_detection.leader_check.interval: 3s
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.retry_count: 9

there is still the same error with 9 checks:

[2020-09-28T00:36:13,048][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:36:43,197][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:36:53,489][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:37:18,521][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:37:51,866][INFO ][o.e.c.c.Coordinator      ] [serverra1_warm.sit.comp.state] master node [{serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6
Onxdg}{1
0.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] failed
, restarting discovery
org.elasticsearch.ElasticsearchException: node [{serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id
=rac
k_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] failed [9] consecutive checks
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [serverra2.sit.comp.state][10.100.24.231:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{y
fSbR
USqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform
.node=true}] has been removed from the cluster

...
[2020-09-28T00:38:17,424][INFO ][o.e.c.c.JoinHelper       ] [serverra1_warm.sit.comp.state] last failed join attempt was 4.6h ago, failed to join {serverla1.sit.comp.state}{
DtuHk-M9
TRi1axJC_BzVog}{HPr8cck9Tt6XvjX0QReKIA}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true,
 data=hot, transform.node=true} with JoinRequest{sourceNode={serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301
}{dl
rt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, transform.node=true, ml.max_open_jobs=20}, minimumTerm=67, optionalJoin=Optional[Join{
term=67, lastAcceptedTerm=64, lastAcceptedVersion=412684, sourceNode={serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24
.230
:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, transform.node=true, ml.max_open_jobs=20}, targetNode={serverla1.sit.comp.sta
te}{
DtuHk-M9TRi1axJC_BzVog}{HPr8cck9Tt6XvjX0QReKIA}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.install
ed=true, data=hot, transform.node=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [serverla1.sit.comp.state][10.100.24.233:9300][internal:cluster/coordination/join] request_id [885] timed out a
fter
 [59992ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1041) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.TstateeadContext$ContextPreservingRunnable.run(TstateeadContext.java:633) ~[elasticsearch-7.7.0.jar:7.7.0]
...
[2020-09-28T00:38:17,427][WARN ][o.e.c.c.ClusterFormationFailureHelper] [serverra1_warm.sit.comp.state] master not discovered yet: have discovered [{serverra1_warm.sit.comp.
state}{K
yQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, tr
ansform.node=true, ml.max_open_jobs=20}, {serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2
, ml
.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6
Onxd
g}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {l
ogmanra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_
open
_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{10.100.24.232}{10.100.24.232:9300}
{dil
mrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}]; discovery will continue using [10.100.24.2
30:9300, 10.100.24.231:9300, 10.100.24.232:9300] from hosts providers and [{serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.2
4.23
3:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp
4QJm
jUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true,
data=hot, transform.node=true}, {serverra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.ma
chin
e_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{
10.1
00.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] from last
-known cluster state; node term 68, last-accepted version 414155 in term 68
[2020-09-28T00:38:29,455][WARN ][o.e.c.c.ClusterFormationFailureHelper] [serverra1_warm.sit.comp.state] master not discovered yet: have discovered [{serverra1_warm.sit.comp.
state}{K
yQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, tr
ansform.node=true, ml.max_open_jobs=20}, {serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2
, ml
.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6
Onxd
g}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {l
ogmanra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_
open
_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{10.100.24.232}{10.100.24.232:9300}
{dil
mrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}]; discovery will continue using [10.100.24.2
30:9300, 10.100.24.231:9300, 10.100.24.232:9300] from hosts providers and [{serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.2
4.23
3:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp
4QJm
jUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true,
data=hot, transform.node=true}, {serverra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.ma
chin
e_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{
10.1
00.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] from last
-known cluster state; node term 68, last-accepted version 414155 in term 68

once again, cluster turns yellow with this settings:

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_recoveries" : "0",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "40",
"indices.recovery.max_bytes_per_sec" : "250mb"

and warm nodes (only, not hot one) starts to fall out after this:

   "cluster.routing.allocation.node_concurrent_incoming_recoveries" : "4",
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "4",
    "cluster.routing.allocation.node_concurrent_recoveries" : "8",
    "cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
    "cluster.routing.allocation.node_initial_primaries_recoveries" : "40",
    "indices.recovery.max_bytes_per_sec" : "250mb"

It looks like you are pushing your nodes far harder than the defaults allow, and far harder than they can cope:

The default for this is 2 and increasing it is not recommended.

The default for this is 2 and increasing it is not recommended.

The default for this is 2 and increasing it is not recommended.

The default for this is 4 and increasing it is not recommended.

The default for this is 40mb; increasing it is ok if your nodes can cope with the increase, but the evidence indicates that 250mb is far too high for your warm nodes.

1 Like

These settings are documented as expert settings with the following warning:

WARNING: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

You should not be adjusting them.

Hi David,

thanks for answer.

tried before with:

       "cluster.routing.allocation.node_concurrent_incoming_recoveries" : "1",
        "cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "1",
        "cluster.routing.allocation.node_concurrent_recoveries" : "2",
        "cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
        "cluster.routing.allocation.node_initial_primaries_recoveries" : "10",
        "indices.recovery.max_bytes_per_sec" : "250mb"

but same result. Ok, indices.recovery.max_bytes_per_sec is now 80MB (our network is 10G).

There is lot of unassigned shards, node_left (all nodes are online). I'll try first to manualy allocate shards to node (where they are) and then continue. Any other suggestion is welcome.

The limiting factor is likely your disk bandwidth, not your network. Saturated disk bandwidth would explain the symptoms you're describing. Even 80mb (i.e. 640Mbps) may be too high for spinning disks.

(edit: 640Mbps not 640MBps :roll_eyes: )

Why did you not reduce this to the default of 4 (or lower)? It may not make much difference, but your cluster is showing severe signs of overload, you definitely should not be relaxing any of these limits.

ok, now it's 20mb (indices.recovery.max_bytes_per_sec)

cluster was doing just fine with

cluster.routing.allocation.node_initial_primaries_recoveries" : "10"

even 20 or 30 (to yellow state), problems starts with enabling concurrent_recoveries.

ok, and cluster.routing.allocation.node_initial_primaries_recoveries is default 4

Hi,

update. Still not in green:

  "number_of_nodes" : 8,
  "number_of_data_nodes" : 8,
  "active_primary_shards" : 6118,
  "active_shards" : 6987,
  "relocating_shards" : 0,
  "initializing_shards" : 13,
  "unassigned_shards" : 6813,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 31,
  "number_of_in_flight_fetch" : 488,
  "task_max_waiting_in_queue_millis" : 12674,
  "active_shards_percent_as_number" : 50.58278433359879

Last warm node was down this morning (3h ago) so now setup is min. as possible

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_recoveries" : "2",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "4",
"indices.recovery.max_bytes_per_sec" : "20mb"

But here and there (every 15-20 min) we see massive reading 2GB/s (no write) on all hdd (around 3-4 min) that belongs to warm node. Is this internal job reading state on all disks for cluster state or, any workaround ?

[2020-09-29T08:18:31,629][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [13266ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6862] unchanged indices
[2020-09-29T08:35:59,043][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T09:04:18,835][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [12493ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6864] unchanged indices
[2020-09-29T09:21:23,240][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [23118ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:22:02,740][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [29629ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6863] unchanged indices
[2020-09-29T09:22:14,778][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T09:22:22,896][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [20066ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [4] indices and skipped [6862] unchanged indices
[2020-09-29T09:23:02,220][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [14211ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [2] indices and skipped [6864] unchanged indices
[2020-09-29T09:29:55,293][INFO ][o.e.m.j.JvmGcMonitorService] [serverra3_warm.sit.comp.state] [gc][115326] overhead, spent [687ms] collecting in the last [1.4s]
[2020-09-29T09:32:18,047][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10102ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6863] unchanged indices
[2020-09-29T09:32:33,372][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [15213ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [2] indices and skipped [6864] unchanged indices
[2020-09-29T09:33:03,862][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10608ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:33:21,169][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10435ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:33:51,257][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [23078ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:34:05,110][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [13821ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [2] indices and skipped [6864] unchanged indices
[2020-09-29T09:35:34,168][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [11695ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:35:44,631][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10402ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T10:04:09,530][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T10:04:09,619][WARN ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] cluster state applier task [ApplyCommitRequest{term=68, version=440847, sourceNode={serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}] took [31.7s] which is above the warn tstateeshold of [30s]: [running task [ApplyCommitRequest{term=68, version=440847, sourceNode={serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}]] took [0ms], [connecting to new nodes] took [0ms], [applying settings] took [0ms], [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@3c206a30]] took [31717ms], [running applier [org.elasticsearch.script.ScriptService@7e6de2ab]] took [0ms], [running applier [org.elasticsearch.xpack.ilm.IndexLifecycleService@f297a52]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@26d88ba5]] took [0ms], [running applier [org.elasticsearch.snapshots.RestoreService@72f9a78]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@ed275b8]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@755a6441]] took [0ms], [running applier [org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$3650/0x00007fb3b34fd0b0@6308e5cb]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@203d9fc7]] took [0ms], [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@12659bb9]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@7d401c2e]] took [2ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@3622b917]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.authc.TokenService$$Lambda$2346/0x00007fb3f50ff858@68bd1334]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$2429/0x00007fb3f51ca058@3dea896f]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@1e2375c4]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherLifeCycleService@1e87f187]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherIndexingListener@565b6fe3]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlIndexTemplateRegistry@444b7943]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@7da7b9da]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner@18deae60]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService@2922f412]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlAssignmentNotifier@2c3b427]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlInitializationService@301071f1]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.history.ILMHistoryTemplateRegistry@5dd0743d]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.IndexLifecycleService@f297a52]] took [0ms], [notifying listener [org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@63e....
[2020-09-29T10:04:20,925][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T10:04:54,817][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [77794ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6865] unchanged indices
[2020-09-29T10:06:54,687][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [119861ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [5] indices and skipped [6863] unchanged indices
[2020-09-29T10:07:32,146][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T10:08:14,307][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [79728ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6865] unchanged indices
[2020-09-29T10:08:52,678][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [38247ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [5] indices and skipped [6863] unchanged indices
[2020-09-29T10:09:25,831][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [33281ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [7] indices and skipped [6861] unchanged indices
[2020-09-29T10:09:57,267][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [31290ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [6] indices and skipped [6862] unchanged indices
[2020-09-29T10:10:17,386][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [13196ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [4] indices and skipped [6864] unchanged indices

Puzzling that there's such a high read rate, and no this isn't anything to do with cluster state updates although it certainly seems to be interfering with them. Can you GET _nodes/serverra3_warm.sit.comp.state/hot_threads?threads=99999 to see why Elasticsearch is reading from disk so heavily?

Hang on, is Elasticsearch's disk on /dev/dm-2? If so, and if this read traffic was caused by Elasticsearch, then it'd be coming through /dev/dm-2, but that device reports zero read traffic.

dm-2 is ssd /dev/sda, hot node. Hdds for warm node are from sdb - sdi

i'm waiting for big read for GET hot threads

1 Like

one warm node is out just now, server was doing nothing. Both nodes (master and warm) are on the same server

master log: https://pastebin.pl/view/d18a3ec1

warm node log: https://pastebin.pl/view/ebc36c2e

threads: https://pastebin.pl/view/e74dab9d

log is now in DEBUG mode

settings still:

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_recoveries" : "2",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "4",
"indices.recovery.max_bytes_per_sec" : "20mb"

output from warm node ra2 with big read (the same one from post above) https://pastebin.pl/view/2e1ac614

one more: https://pastebin.pl/view/f9e2ff26

Looks like https://github.com/elastic/elasticsearch/issues/56933 so my first suggestion would be to reduce the size of the fetch_shard_store threadpool to 1 on your warm nodes.

2 Likes

this setting can be set dynamically without restart warm node ? if so how ? thanks

No, you must set it in elasticsearch.yml and restart the nodes. You can do this in a rolling fashion, however, one-node-at-a-time.

many thanks for fast responses

question was because it takes 2-3 hour, after restart of warm node, to get to the point of yellow state. We decided to clear all replica only for warm nodes and recreate them after reaching green state.

Did you look at case when warm node fall out without disk activity (post 4h ago) ?

Such is life when your cluster has far too many shards.

That sounds like a bad idea to me: green health with no replicas is actually less healthy than yellow health with replicas, since at least in the latter case some of the replicas are assigned. Moreover if it is the issue I linked above then you'll just hit the same situation when re-enabling replicas.

1 Like