Does creating snapshot make cluster slow?

I use ElasticSearch 5.6.

When running snapshot, I run http://localhost:9200/_cluster/health but did not get response for more than 10 sec. I can also see when snapshot runs, machines have a lot of costs at disk/network IO.

Such a delay does not happen if I do not run snapshot.

I check _cluster/health with timeout to ensure that creating snapshot does not slow-down queries. Is it the correct way to check this? In practice will creating snapshots make queries slow down?

I imagine creating snapshots could slow down your queries. Snapshots are going to read lots of blocks off disk, and those blocks will go into the filesystem cache. Queries that might have been served from FS cache may need to go to disk instead, and then they will be competing with the snapshot process for access to disk IO. The effect would be magnified for larger documents, large result sets, and scan/scrolls.

@loren, Do we have any good practice to follow about when to run snapshot?

I run snapshot for a running elasticsearch cluster that provides service to users. Ideally we may not want to affect users' search experience.

One option is doing this when traffic is low. But it is still possible to that traffic changes suddenly...

I don't know of anything other than spreading the data onto more nodes, or increasing the RAM of each node. Both of those will improve the ratio of RAM to disk, which would lessen the impact of FS cache purges and disk contention. Better compression could help in the same way.

But first I'd want to benchmark some more to be sure that snapshot is indeed interfering with query performance. And then I'd try to make the queries less reliant on disk reads, perhaps by storing fewer fields or breaking the data up into multiple indices.

Good luck!

1 Like

Will old GC be the problem of the slowdown? I made another post: Does GC at snapshot affect performance? can users force GC?

I doubt it, unless your JVM is under so much memory pressure that it has to stop the world. Doesn't seem to be the case for you.

I'd suggest running something like iostat before/during/after your snapshot. If you see IO Wait times spike during the snapshot, then any queries not served 100% from memory are going to take longer. If you regularly see IO Wait when snapshot isn't running, then you already have an overburdened storage system and you have bigger problems than snapshot.

Make sure you run iostat on the volumes that contain your shards. For example, I might run iostat -mx nvme0n1 11 to monitor how a mounted NVMe drive on my EC2 i3 instance is performing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.