Cluster creates index with 5 shards on same node. And it picks the node with least disk space.
High performance ETL process writes all the data on to single node instead of to 5 multiple nodes. This causes overloads of this node CPU (100%) and whole cluster becomes in high latency/ unavailable. (writing a single document lasts several seconds)
The problem is related to cluster shard allocation. When new index is created via ILM it creates all shards on this node.
I tried to temporarily exclude the node from cluster and rollover the index and it helped - it creates the index with shards across several servers and all is fine.
However after some time when ILM rollover new indices it happened again
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude._name" : "elastic12"
}
}
Our cluster is 15 datanodes, 3 warm nodes, 3 master nodes. 2 coordinator nodes.
Data nodes 32GB RAM, 500GB SSD, 8core.
Elastic version 7.13.3
do you have any clue why this happened and how to avoid the problem repetition?
shards disk.indices disk.used disk.avail disk.total disk.percent node
111 406.8gb 440gb 51gb 491gb 89 elastic12 <------ this one see shards and disk space
228 400.3gb 433.2gb 57.7gb 491gb 88 elastic04
230 394.1gb 432.7gb 58.3gb 491gb 88 elastic05
237 283.6gb 314.6gb 176.4gb 491gb 64 elastic13
237 271gb 301.8gb 189.1gb 491gb 61 elastic07
237 321gb 350.8gb 140.1gb 491gb 71 elastic14
237 245.6gb 276.6gb 214.4gb 491gb 56 elastic09
237 195.5gb 226.4gb 264.6gb 491gb 46 elastic02
237 277.4gb 308gb 183gb 491gb 62 elastic06
237 303.1gb 333.4gb 157.6gb 491gb 67 elastic15
237 236.1gb 266.8gb 224.2gb 491gb 54 elastic08
237 287.3gb 318.8gb 172.2gb 491gb 64 elastic03
237 251.9gb 283gb 208gb 491gb 57 elastic01
238 343.5gb 375.3gb 115.6gb 491gb 76 elastic11
238 322.8gb 354.9gb 136gb 491gb 72 elastic10