I have been observing really slow start up times for restoring snapshots in elasticsearch. I am using graphana as a tool to monitor the cluster. As you can see the snapshot was started around 17:15 and only saw significant network usage about 3 hours later.
Is this normal? how does the restore process work and why does it not start immediately ?
It is worth mentioning that this snapshot did not successfully finish (reasons unknown), but similar behaviour have been observed with previous snapshots.
I am restoring one index with 13.2G documents with 470 shards, about 25T of data, restoring 1 replica and adding 1 replica in the restore process. I'm looking at accelerating as there is a huge 3 hour lag before it actually starts and it only goes up to 20Mb/s which is slow considering all these services are being hosted in GCP so we should be able to getting higher internal bandwidth speeds. In an attempt to speed up the process I have made the following changes:
{
"persistent": {
"indices.recovery.max_bytes_per_sec": "250mb",
"indices.recovery.max_concurrent_file_chunks": 5,
"indices.recovery.concurrent_streams": 10,
"cluster.routing.allocation.node_concurrent_recoveries": 10,
"cluster.routing.allocation.node_initial_primaries_recoveries": 20
}
}
Any recommendations or rules of thumb to these settings would be appreciated.
Thank you in advance.
EDIT:
There doesn't seem to be any pending tasks when I get _cluster/pending_tasks
{
"tasks": []
}
and cluster health :
{
"cluster_name": "elasticsearch",
"status": "red",
"timed_out": false,
"number_of_nodes": 7,
"number_of_data_nodes": 7,
"active_primary_shards": 5,
"active_shards": 10,
"relocating_shards": 0,
"initializing_shards": 140,
"unassigned_shards": 1270,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 0.7042253521126761
}
the network bandwith continues at only a few Kb/s:
Operation/s:
CPU usage:
Document Count and Indexing Rate:
It's seems to be writing docs but at really slow pace. I suspect this will speed up soon, but this stage is very slow.