Warm nodes on spinning disks not stable, falling out of cluster during recovery

otoce · September 27, 2020, 11:34am

Hi,

our warms nodes are falling out of cluster while in recovery, and then start recovery from zero. Hot nodes are ok.

Env:
4 servers, on each (256GB ram, 8 cpu) : 1 hot node, 64GB, ssd, 30 days of data, heavy input from logstash 5-10k/s. 1 warm node, 32GB heap, spinning disks, data after 30 days. 6k indices, 20k shards.

Warm nodes are falling out here and there. It is possible to bring cluster to yellow state with:

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_recoveries" : "0",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "40"

but after enabling concurrent_recoveries to 8 or even 4 warm nodes begin to fall out of the cluster every 2-3h, heavy load on spinning disks.

extra settings:

cluster.publish.info_timeout: 40s
cluster.publish.timeout: 120s
cluster.follower_lag.timeout: 300s
discovery.zen.fd.ping_timeout : 111s
discovery.zen.fd.ping_retries : 5

status:

  "cluster_name" : "server-cluster",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 8,
  "active_primary_shards" : 6775,
  "active_shards" : 9001,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 11281,
  "delayed_unassigned_shards" : 158,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 44.37925253919732

Please advice with other timeout or ... settings, thanks

log from master node:

[2020-09-27T11:01:14,432][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411159, reason: ApplyCommitRequest{term=64, version=411159, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:11,207][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411161, reason: ApplyCommitRequest{term=64, version=411161, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:15,982][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411162, reason: ApplyCommitRequest{term=64, version=411162, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:23,024][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411165, reason: ApplyCommitRequest{term=64, version=411165, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:26,112][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411166, reason: ApplyCommitRequest{term=64, version=411166, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:29,865][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411168, reason: ApplyCommitRequest{term=64, version=411168, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:32,071][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411169, reason: ApplyCommitRequest{term=64, version=411169, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:40,727][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411171, reason: ApplyCommitRequest{term=64, version=411171, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:41,762][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411172, reason: ApplyCommitRequest{term=64, version=411172, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:44,702][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411174, reason: ApplyCommitRequest{term=64, version=411174, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:48,272][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411175, reason: ApplyCommitRequest{term=64, version=411175, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:51,359][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411176, reason: ApplyCommitRequest{term=64, version=411176, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:54,992][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411177, reason: ApplyCommitRequest{term=64, version=411177, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:04:00,065][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411179, reason: ApplyCommitRequest{term=64, version=411179, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:04:07,309][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [11794ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6770] unchanged indices
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_initial_primaries_recoveries] from [10] to [20]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [4]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [4]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings  ] [serverra3_warm.sit.comp.state] updating [indices.recovery.max_bytes_per_sec] from [200mb] to [500mb]

log from warm node https://pastebin.pl/view/5f4990c8

warkolm · September 28, 2020, 3:33am

You are massively over sharded, I would start by shrinking or deleting some indices.

otoce · September 28, 2020, 6:25am

Hi Mark,

thanks for answer. We'll do that, but how to bring cluster to green first

with this extra settings (default*3) from last post:

transport.connect_timeout: 90s
transport.ping_schedule: 5s
cluster.fault_detection.follower_check.interval: 3s
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.retry_count: 9
cluster.fault_detection.leader_check.interval: 3s
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.retry_count: 9

there is still the same error with 9 checks:

[2020-09-28T00:36:13,048][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:36:43,197][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:36:53,489][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:37:18,521][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra1_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-28T00:37:51,866][INFO ][o.e.c.c.Coordinator      ] [serverra1_warm.sit.comp.state] master node [{serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6
Onxdg}{1
0.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] failed
, restarting discovery
org.elasticsearch.ElasticsearchException: node [{serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id
=rac
k_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] failed [9] consecutive checks
...
Caused by: org.elasticsearch.transport.RemoteTransportException: [serverra2.sit.comp.state][10.100.24.231:9300][internal:coordination/fault_detection/leader_check]
Caused by: org.elasticsearch.cluster.coordination.CoordinationStateRejectedException: rejecting leader check since [{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{y
fSbR
USqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform
.node=true}] has been removed from the cluster

...
[2020-09-28T00:38:17,424][INFO ][o.e.c.c.JoinHelper       ] [serverra1_warm.sit.comp.state] last failed join attempt was 4.6h ago, failed to join {serverla1.sit.comp.state}{
DtuHk-M9
TRi1axJC_BzVog}{HPr8cck9Tt6XvjX0QReKIA}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true,
 data=hot, transform.node=true} with JoinRequest{sourceNode={serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301
}{dl
rt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, transform.node=true, ml.max_open_jobs=20}, minimumTerm=67, optionalJoin=Optional[Join{
term=67, lastAcceptedTerm=64, lastAcceptedVersion=412684, sourceNode={serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24
.230
:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, transform.node=true, ml.max_open_jobs=20}, targetNode={serverla1.sit.comp.sta
te}{
DtuHk-M9TRi1axJC_BzVog}{HPr8cck9Tt6XvjX0QReKIA}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.install
ed=true, data=hot, transform.node=true}}]}
org.elasticsearch.transport.ReceiveTimeoutTransportException: [serverla1.sit.comp.state][10.100.24.233:9300][internal:cluster/coordination/join] request_id [885] timed out a
fter
 [59992ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:1041) ~[elasticsearch-7.7.0.jar:7.7.0]
        at org.elasticsearch.common.util.concurrent.TstateeadContext$ContextPreservingRunnable.run(TstateeadContext.java:633) ~[elasticsearch-7.7.0.jar:7.7.0]
...
[2020-09-28T00:38:17,427][WARN ][o.e.c.c.ClusterFormationFailureHelper] [serverra1_warm.sit.comp.state] master not discovered yet: have discovered [{serverra1_warm.sit.comp.
state}{K
yQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, tr
ansform.node=true, ml.max_open_jobs=20}, {serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2
, ml
.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6
Onxd
g}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {l
ogmanra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_
open
_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{10.100.24.232}{10.100.24.232:9300}
{dil
mrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}]; discovery will continue using [10.100.24.2
30:9300, 10.100.24.231:9300, 10.100.24.232:9300] from hosts providers and [{serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.2
4.23
3:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp
4QJm
jUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true,
data=hot, transform.node=true}, {serverra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.ma
chin
e_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{
10.1
00.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] from last
-known cluster state; node term 68, last-accepted version 414155 in term 68
[2020-09-28T00:38:29,455][WARN ][o.e.c.c.ClusterFormationFailureHelper] [serverra1_warm.sit.comp.state] master not discovered yet: have discovered [{serverra1_warm.sit.comp.
state}{K
yQI0BMySRKE4yoeitADCQ}{yfSbRUSqSiuc6gdykGyHRg}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, xpack.installed=true, data=warm, tr
ansform.node=true, ml.max_open_jobs=20}, {serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.24.233:9300}{dilmrt}{rack_id=rack_2
, ml
.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6
Onxd
g}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {l
ogmanra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_
open
_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{10.100.24.232}{10.100.24.232:9300}
{dil
mrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}]; discovery will continue using [10.100.24.2
30:9300, 10.100.24.231:9300, 10.100.24.232:9300] from hosts providers and [{serverla1.sit.comp.state}{DtuHk-M9TRi1axJC_BzVog}{K8VqCjeyTM2InWrv1mb2iw}{10.100.24.233}{10.100.2
4.23
3:9300}{dilmrt}{rack_id=rack_2, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra2.sit.comp.state}{wHODtVp
4QJm
jUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true,
data=hot, transform.node=true}, {serverra1.sit.comp.state}{WJ1uKAbVTd-9BvtMZYk37g}{ymLuRdgiRnilAiPt7-jWrA}{10.100.24.230}{10.100.24.230:9300}{dilmrt}{rack_id=rack_one, ml.ma
chin
e_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}, {serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{0S9K9G-BRG2F_u2__tlZoA}{
10.1
00.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}] from last
-known cluster state; node term 68, last-accepted version 414155 in term 68

once again, cluster turns yellow with this settings:

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_recoveries" : "0",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "40",
"indices.recovery.max_bytes_per_sec" : "250mb"

and warm nodes (only, not hot one) starts to fall out after this:

   "cluster.routing.allocation.node_concurrent_incoming_recoveries" : "4",
    "cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "4",
    "cluster.routing.allocation.node_concurrent_recoveries" : "8",
    "cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
    "cluster.routing.allocation.node_initial_primaries_recoveries" : "40",
    "indices.recovery.max_bytes_per_sec" : "250mb"

DavidTurner · September 28, 2020, 6:36am

It looks like you are pushing your nodes far harder than the defaults allow, and far harder than they can cope:

The default for this is 2 and increasing it is not recommended.

The default for this is 4 and increasing it is not recommended.

The default for this is 40mb; increasing it is ok if your nodes can cope with the increase, but the evidence indicates that 250mb is far too high for your warm nodes.

DavidTurner · September 28, 2020, 6:38am

otoce:

cluster.fault_detection.follower_check.interval: 3s
cluster.fault_detection.follower_check.timeout: 30s
cluster.fault_detection.follower_check.retry_count: 9
cluster.fault_detection.leader_check.interval: 3s
cluster.fault_detection.leader_check.timeout: 30s
cluster.fault_detection.leader_check.retry_count: 9

These settings are documented as expert settings with the following warning:

WARNING: If you adjust these settings then your cluster may not form correctly or may become unstable or intolerant of certain failures.

You should not be adjusting them.

otoce · September 28, 2020, 7:07am

Hi David,

thanks for answer.

tried before with:

       "cluster.routing.allocation.node_concurrent_incoming_recoveries" : "1",
        "cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "1",
        "cluster.routing.allocation.node_concurrent_recoveries" : "2",
        "cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
        "cluster.routing.allocation.node_initial_primaries_recoveries" : "10",
        "indices.recovery.max_bytes_per_sec" : "250mb"

but same result. Ok, indices.recovery.max_bytes_per_sec is now 80MB (our network is 10G).

There is lot of unassigned shards, node_left (all nodes are online). I'll try first to manualy allocate shards to node (where they are) and then continue. Any other suggestion is welcome.

DavidTurner · September 28, 2020, 7:24am

The limiting factor is likely your disk bandwidth, not your network. Saturated disk bandwidth would explain the symptoms you're describing. Even 80mb (i.e. 640Mbps) may be too high for spinning disks.

(edit: 640Mbps not 640MBps )

DavidTurner · September 28, 2020, 7:27am

Why did you not reduce this to the default of 4 (or lower)? It may not make much difference, but your cluster is showing severe signs of overload, you definitely should not be relaxing any of these limits.

otoce · September 28, 2020, 7:45am

ok, now it's 20mb (indices.recovery.max_bytes_per_sec)

cluster was doing just fine with

cluster.routing.allocation.node_initial_primaries_recoveries" : "10"

even 20 or 30 (to yellow state), problems starts with enabling concurrent_recoveries.

ok, and cluster.routing.allocation.node_initial_primaries_recoveries is default 4

otoce · September 29, 2020, 8:32am

Hi,

update. Still not in green:

  "number_of_nodes" : 8,
  "number_of_data_nodes" : 8,
  "active_primary_shards" : 6118,
  "active_shards" : 6987,
  "relocating_shards" : 0,
  "initializing_shards" : 13,
  "unassigned_shards" : 6813,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 31,
  "number_of_in_flight_fetch" : 488,
  "task_max_waiting_in_queue_millis" : 12674,
  "active_shards_percent_as_number" : 50.58278433359879

Last warm node was down this morning (3h ago) so now setup is min. as possible

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_recoveries" : "2",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "4",
"indices.recovery.max_bytes_per_sec" : "20mb"

But here and there (every 15-20 min) we see massive reading 2GB/s (no write) on all hdd (around 3-4 min) that belongs to warm node. Is this internal job reading state on all disks for cluster state or, any workaround ?

[2020-09-29T08:18:31,629][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [13266ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6862] unchanged indices
[2020-09-29T08:35:59,043][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T09:04:18,835][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [12493ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6864] unchanged indices
[2020-09-29T09:21:23,240][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [23118ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:22:02,740][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [29629ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6863] unchanged indices
[2020-09-29T09:22:14,778][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T09:22:22,896][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [20066ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [4] indices and skipped [6862] unchanged indices
[2020-09-29T09:23:02,220][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [14211ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [2] indices and skipped [6864] unchanged indices
[2020-09-29T09:29:55,293][INFO ][o.e.m.j.JvmGcMonitorService] [serverra3_warm.sit.comp.state] [gc][115326] overhead, spent [687ms] collecting in the last [1.4s]
[2020-09-29T09:32:18,047][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10102ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6863] unchanged indices
[2020-09-29T09:32:33,372][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [15213ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [2] indices and skipped [6864] unchanged indices
[2020-09-29T09:33:03,862][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10608ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:33:21,169][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10435ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:33:51,257][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [23078ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:34:05,110][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [13821ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [2] indices and skipped [6864] unchanged indices
[2020-09-29T09:35:34,168][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [11695ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T09:35:44,631][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [10402ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6865] unchanged indices
[2020-09-29T10:04:09,530][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T10:04:09,619][WARN ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] cluster state applier task [ApplyCommitRequest{term=68, version=440847, sourceNode={serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}] took [31.7s] which is above the warn tstateeshold of [30s]: [running task [ApplyCommitRequest{term=68, version=440847, sourceNode={serverra2.sit.comp.state}{wHODtVp4QJmjUVaO86eUPw}{BU0qh0kUR6O2YzrQ6Onxdg}{10.100.24.231}{10.100.24.231:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}]] took [0ms], [connecting to new nodes] took [0ms], [applying settings] took [0ms], [running applier [org.elasticsearch.indices.cluster.IndicesClusterStateService@3c206a30]] took [31717ms], [running applier [org.elasticsearch.script.ScriptService@7e6de2ab]] took [0ms], [running applier [org.elasticsearch.xpack.ilm.IndexLifecycleService@f297a52]] took [0ms], [running applier [org.elasticsearch.repositories.RepositoriesService@26d88ba5]] took [0ms], [running applier [org.elasticsearch.snapshots.RestoreService@72f9a78]] took [0ms], [running applier [org.elasticsearch.ingest.IngestService@ed275b8]] took [0ms], [running applier [org.elasticsearch.action.ingest.IngestActionForwarder@755a6441]] took [0ms], [running applier [org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$$Lambda$3650/0x00007fb3b34fd0b0@6308e5cb]] took [0ms], [running applier [org.elasticsearch.tasks.TaskManager@203d9fc7]] took [0ms], [notifying listener [org.elasticsearch.cluster.InternalClusterInfoService@12659bb9]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@7d401c2e]] took [2ms], [notifying listener [org.elasticsearch.xpack.security.support.SecurityIndexManager@3622b917]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.authc.TokenService$$Lambda$2346/0x00007fb3f50ff858@68bd1334]] took [0ms], [notifying listener [org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$$Lambda$2429/0x00007fb3f51ca058@3dea896f]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.support.WatcherIndexTemplateRegistry@1e2375c4]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherLifeCycleService@1e87f187]] took [0ms], [notifying listener [org.elasticsearch.xpack.watcher.WatcherIndexingListener@565b6fe3]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlIndexTemplateRegistry@444b7943]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.job.process.autodetect.AutodetectProcessManager@7da7b9da]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.datafeed.DatafeedManager$TaskRunner@18deae60]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService@2922f412]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlAssignmentNotifier@2c3b427]] took [0ms], [notifying listener [org.elasticsearch.xpack.ml.MlInitializationService@301071f1]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.history.ILMHistoryTemplateRegistry@5dd0743d]] took [0ms], [notifying listener [org.elasticsearch.xpack.ilm.IndexLifecycleService@f297a52]] took [0ms], [notifying listener [org.elasticsearch.xpack.core.slm.history.SnapshotLifecycleTemplateRegistry@63e....
[2020-09-29T10:04:20,925][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T10:04:54,817][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [77794ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6865] unchanged indices
[2020-09-29T10:06:54,687][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [119861ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [5] indices and skipped [6863] unchanged indices
[2020-09-29T10:07:32,146][ERROR][o.e.x.m.c.n.NodeStatsCollector] [serverra3_warm.sit.comp.state] collector [node_stats] timed out when collecting data
[2020-09-29T10:08:14,307][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [79728ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [3] indices and skipped [6865] unchanged indices
[2020-09-29T10:08:52,678][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [38247ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [5] indices and skipped [6863] unchanged indices
[2020-09-29T10:09:25,831][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [33281ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [7] indices and skipped [6861] unchanged indices
[2020-09-29T10:09:57,267][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [31290ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [6] indices and skipped [6862] unchanged indices
[2020-09-29T10:10:17,386][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [13196ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [4] indices and skipped [6864] unchanged indices

DavidTurner · September 29, 2020, 9:14am

Puzzling that there's such a high read rate, and no this isn't anything to do with cluster state updates although it certainly seems to be interfering with them. Can you GET _nodes/serverra3_warm.sit.comp.state/hot_threads?threads=99999 to see why Elasticsearch is reading from disk so heavily?

DavidTurner · September 29, 2020, 9:20am

Hang on, is Elasticsearch's disk on /dev/dm-2? If so, and if this read traffic was caused by Elasticsearch, then it'd be coming through /dev/dm-2, but that device reports zero read traffic.

otoce · September 29, 2020, 9:47am

dm-2 is ssd /dev/sda, hot node. Hdds for warm node are from sdb - sdi

i'm waiting for big read for GET hot threads

otoce · September 29, 2020, 10:34am

one warm node is out just now, server was doing nothing. Both nodes (master and warm) are on the same server

master log: https://pastebin.pl/view/d18a3ec1

warm node log: https://pastebin.pl/view/ebc36c2e

threads: https://pastebin.pl/view/e74dab9d

log is now in DEBUG mode

settings still:

"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "1",
"cluster.routing.allocation.node_concurrent_recoveries" : "2",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "4",
"indices.recovery.max_bytes_per_sec" : "20mb"

otoce · September 29, 2020, 11:49am

output from warm node ra2 with big read (the same one from post above) https://pastebin.pl/view/2e1ac614

one more: https://pastebin.pl/view/f9e2ff26

DavidTurner · September 29, 2020, 12:03pm

Looks like https://github.com/elastic/elasticsearch/issues/56933 so my first suggestion would be to reduce the size of the fetch_shard_store threadpool to 1 on your warm nodes.

otoce · September 29, 2020, 1:13pm

this setting can be set dynamically without restart warm node ? if so how ? thanks

DavidTurner · September 29, 2020, 1:20pm

No, you must set it in elasticsearch.yml and restart the nodes. You can do this in a rolling fashion, however, one-node-at-a-time.

otoce · September 29, 2020, 2:34pm

many thanks for fast responses

question was because it takes 2-3 hour, after restart of warm node, to get to the point of yellow state. We decided to clear all replica only for warm nodes and recreate them after reaching green state.

Did you look at case when warm node fall out without disk activity (post 4h ago) ?

DavidTurner · September 29, 2020, 3:01pm

Such is life when your cluster has far too many shards.

That sounds like a bad idea to me: green health with no replicas is actually less healthy than yellow health with replicas, since at least in the latter case some of the replicas are assigned. Moreover if it is the issue I linked above then you'll just hit the same situation when re-enabling replicas.

Topic		Replies	Views
Shards Taking a Long Time to Move Between Nodes - Cloud [7.1.1] Elasticsearch	50	4124	July 29, 2019
Elasticsearch global state file Elasticsearch	18	3257	June 6, 2019
Slow startup (replica recovery in logs) Elasticsearch	11	1820	July 6, 2017
New User -- Index Settings Reccomdendations and Suggestions Elasticsearch	8	462	July 6, 2017
Shard Stuck in INITIALIZING and RELOCATING for more than 12 hours Elasticsearch	37	33299	January 25, 2019

Warm nodes on spinning disks not stable, falling out of cluster during recovery

Related topics