Hi guys,
We've added a few nodes to spread the disk load. Cluster stayed yellow but we still got 4 nodes disconnect from the cluster during the index deletion:
insertOrder timeInQueue priority source
7316 21.3s IMMEDIATE node-left
7317 21.3s IMMEDIATE node-left
7318 21.3s IMMEDIATE node-left
7319 21.3s IMMEDIATE node-left
7325 19.3s URGENT node-join
7321 21.3s HIGH shard-failed
7322 21.3s HIGH shard-failed
7323 20.8s HIGH shard-failed
7324 20.8s HIGH shard-failed
7320 21.3s HIGH shard-failed
[2023-05-25T00:47:35,421][WARN ][o.e.c.c.LagDetector ] [esm04] node [{esd02}{nIoZq1ZWRiKgPBz3x6uJAg}{BbJGSC4zRv2ID0hEfXghGw}{x.x.x.x:9300}{cdfhstw}{xpack.installed=true, transform.node=true}] is lagging at cluster state version [13093], although publication of cluster state version [13094] completed [1.5m] ago
[2023-05-25T00:47:35,422][WARN ][o.e.c.c.LagDetector ] [esm04] node [{esd03}{mDYiwqFkS-Sj7A9YcyLmrA}{L2ZdyuXhTr6Mh9vbk8Acjg}{x.x.x.x:9300}{cdfhstw}{xpack.installed=true, transform.node=true}] is lagging at cluster state version [13093], although publication of cluster state version [13094] completed [1.5m] ago
[2023-05-25T00:47:35,422][WARN ][o.e.c.c.LagDetector ] [esm04] node [{esd08}{T83ju1TKQhyZUd2LI4Atlw}{R7tciuZQQQadTA3TFIgWCA}{x.x.x.x:9300}{cdfhstw}{xpack.installed=true, transform.node=true}] is lagging at cluster state version [13093], although publication of cluster state version [13094] completed [1.5m] ago
[2023-05-25T00:47:35,423][WARN ][o.e.c.c.LagDetector ] [esm04] node [{esd06}{ReFWrVXVSf-a1ould6uIEg}{TpJGLw5XQQe3Cen3lIVxIQ}{x.x.x.x:9300}{cdfhstw}{xpack.installed=true, transform.node=true}] is lagging at cluster state version [13093], although publication of cluster state version [13094] completed [1.5m] ago
Nodes immediately rejoined but we got a bunch of UNASSIGNED & INITIALIZING shards in between and YELLOW cluster state, which can go to RED if removed nodes would take out enough shards to cause outage.
Is it safe to bump node_left.delayed_timeout to ~5 minutes to prevent master kicking them out during the deletion operation? I realize that getting faster drives/more instances can speed up the process but we might not have this option.