Currently working with a 4 node ELK cluster (7.6) with 10k indices, 20k Primary Shard, 12k replica shards.
To stabilise the cluster performance, I have attempted freezing and closing indices. so the number of open non-frozen indices is 160 primary + 160 replicas.
A rolling restart is currently taking 5+ hours per node and overall performance is very slow to search.
Is it expected that frozen indices do have impact on cluster performance, even if it is closed and frozen?
You have far too many shards, which leads to a large cluster state and lots of updates that need to be propagated. The default limit of 1000 shards per node is there for a reason so you should look to get below that. Searching a lot of sad mall shards can be a lot slower than querying the same amount of data distributed s as Ross fewer larger shards.
In recent versions the cluster keeps track of frozen and closed shards so closing or freezing does not reduce the size of the cluster state as much as it used to.
This is why I was confused before upgrading from 6.x to 7. closing indices improved performance, and I only kept a month of data open, and rolling restarts took minutes, is this a 7.x change?
Started with daily indices with a 8 shard per index (1P + 1R per node), with 10's - 100's MB per day, obviously I now know that this not healthy for the cluster and in recent times we are operating 1 Primary Shard + 1 Replicas 10's GB per index, 32 GB of RAM per node. CPU is not under pressure.
I feel I just need to go back and reindex out the bad choices of the past into larger indices, and potentially increase the amount of nodes.
Also why so many shards - your data is not that large and why do you need to have a P&R shard on every node? Why is 1P/1R not enough as long as shards stay under 50GB or so?
Separately, I'd think closed indexes would not affect updates or performance significantly. Also for rolling restarts make sure you set your delayed reallocation on node loss (node_left.delayed_timeout) high enough to not move shards on restart; default is 1m; and you can see if things move in cluster/health - we use 15-60m.
it's only historic data that is like that, so just need to go back and reindex it.
The indices are frozen and closed, the overall system performance (ie taking several hours to re-add a node), feels as though they are still being treated as open/frozen.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.