Hi,
our warms nodes are falling out of cluster while in recovery, and then start recovery from zero. Hot nodes are ok.
Env:
4 servers, on each (256GB ram, 8 cpu) : 1 hot node, 64GB, ssd, 30 days of data, heavy input from logstash 5-10k/s. 1 warm node, 32GB heap, spinning disks, data after 30 days. 6k indices, 20k shards.
Warm nodes are falling out here and there. It is possible to bring cluster to yellow state with:
"cluster.routing.allocation.node_concurrent_incoming_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_outgoing_recoveries" : "0",
"cluster.routing.allocation.node_concurrent_recoveries" : "0",
"cluster.routing.allocation.cluster_concurrent_rebalance" : "0",
"cluster.routing.allocation.node_initial_primaries_recoveries" : "40"
but after enabling concurrent_recoveries to 8 or even 4 warm nodes begin to fall out of the cluster every 2-3h, heavy load on spinning disks.
extra settings:
cluster.publish.info_timeout: 40s
cluster.publish.timeout: 120s
cluster.follower_lag.timeout: 300s
discovery.zen.fd.ping_timeout : 111s
discovery.zen.fd.ping_retries : 5
status:
"cluster_name" : "server-cluster",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 8,
"number_of_data_nodes" : 8,
"active_primary_shards" : 6775,
"active_shards" : 9001,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 11281,
"delayed_unassigned_shards" : 158,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 44.37925253919732
Please advice with other timeout or ... settings, thanks
log from master node:
[2020-09-27T11:01:14,432][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411159, reason: ApplyCommitRequest{term=64, version=411159, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:11,207][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411161, reason: ApplyCommitRequest{term=64, version=411161, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:15,982][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411162, reason: ApplyCommitRequest{term=64, version=411162, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:23,024][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411165, reason: ApplyCommitRequest{term=64, version=411165, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:26,112][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411166, reason: ApplyCommitRequest{term=64, version=411166, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:29,865][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411168, reason: ApplyCommitRequest{term=64, version=411168, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:32,071][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411169, reason: ApplyCommitRequest{term=64, version=411169, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:40,727][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411171, reason: ApplyCommitRequest{term=64, version=411171, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:41,762][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411172, reason: ApplyCommitRequest{term=64, version=411172, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:44,702][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411174, reason: ApplyCommitRequest{term=64, version=411174, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:48,272][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411175, reason: ApplyCommitRequest{term=64, version=411175, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:51,359][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411176, reason: ApplyCommitRequest{term=64, version=411176, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:03:54,992][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] removed {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411177, reason: ApplyCommitRequest{term=64, version=411177, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:04:00,065][INFO ][o.e.c.s.ClusterApplierService] [serverra3_warm.sit.comp.state] added {{serverra1_warm.sit.comp.state}{KyQI0BMySRKE4yoeitADCQ}{hB25a3FYRIiwrWMNgHks9A}{10.100.24.230}{10.100.24.230:9301}{dlrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=warm, transform.node=true}}, term: 64, version: 411179, reason: ApplyCommitRequest{term=64, version=411179, sourceNode={serverra3.sit.comp.state}{BVnFEkNNTcKHn-WldV8mlw}{wh_nNgblT7OpMU4_BD59wA}{10.100.24.232}{10.100.24.232:9300}{dilmrt}{rack_id=rack_one, ml.machine_memory=269645852672, ml.max_open_jobs=20, xpack.installed=true, data=hot, transform.node=true}}
[2020-09-27T11:04:07,309][WARN ][o.e.g.PersistedClusterStateService] [serverra3_warm.sit.comp.state] writing cluster state took [11794ms] which is above the warn tstateeshold of [10s]; wrote global metadata [false] and metadata for [1] indices and skipped [6770] unchanged indices
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_initial_primaries_recoveries] from [10] to [20]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_concurrent_incoming_recoveries] from [2] to [4]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings ] [serverra3_warm.sit.comp.state] updating [cluster.routing.allocation.node_concurrent_outgoing_recoveries] from [2] to [4]
[2020-09-27T11:16:14,032][INFO ][o.e.c.s.ClusterSettings ] [serverra3_warm.sit.comp.state] updating [indices.recovery.max_bytes_per_sec] from [200mb] to [500mb]
log from warm node https://pastebin.pl/view/5f4990c8