Snapshot performance with ES 7.10

Hi all:
We have upgraded to ES 7.10 from 6.8. After the upgrade, we are observing that the time taken for the Snapshot on 7.10 is around 3 times slower compared to the snapshots with 6.8. These are snapshots of the ES cluster to an NFS mounted share. There is no change in the configuration from 6.8 to 7.10. I was wondering if anyone has this behavior. If so, are there any settings that need to be configured in ES 7.10 to get back the previous performance?

Thanks.

Is it still slow if you move to a fresh repository?

Hi Dave,
The slowness is happening on a fresh repository as well with a new (greenfield) installation of ES 7.10.

Thanks,
-Karun

Ok, thanks for checking that.

I'm not aware of any performance regression in this area. Can you quantify it in more absolute terms? How large/complex a snapshot are you taking (data volume, number of shards, any other pertinent info) and how long does it take?

Here are the details that you are asking for:

=6.8 ES=

curl get xxxxx:9200/_cat/snapshots/SnapshotRepo_1?v&s=id
id                                    status start_epoch start_time end_epoch  end_time duration indices successful_shards failed_shards total_shards
eca95dae-1774-40be-921e-9311e5206263 SUCCESS 1609997090  05:24:50   1609997354 05:29:14    4.3m        3                12             0           12

=7.10 ES=
curl get xxxx:9200/_cat/snapshots/SnapshotRepo_2?v&s=id
id                                     status start_epoch start_time end_epoch end_time duration indices successful_shards failed_shards total_shards
eddfcb8c-02ab-4dc6-94e9-ab654a60ba51 SUCCESS 1610002940  07:02:20   1610003581 07:13:01    10.6m       3                12             0           12

How large were these indices?

Also note that snapshots will re-use previously-snapshotted data where possible, so if the repository wasn't empty then that will confound your measurements too.

The repository was empty as we trying to compare performance of 6.8 vs 71.0.
Also, there are 3 indices and only one of them is big, maybe around 6.2.GB. The other ones are in few MB or even smaller.

Thanks,
Karun

Thanks. Yes, 10 minutes does seem longer than expected to make a ~6GB snapshot. The only relevant setting I can think of is max_snapshot_bytes_per_sec which defaults to 40mb, i.e. 40MB/s, but of course that applies to both versions. There have been quite a lot of changes to how snapshots work between 6.8 and 7.10 but (as I said) I'm not aware of any that would cause such a performance drop.

Unless anyone else has better ideas I think you'll need to share some logs from every node, with these settings:

logger.org.elasticsearch.repositories: TRACE
logger.org.elasticsearch.snapshots: TRACE
logger.org.elasticsearch.cluster.service.MasterService: DEBUG

That will show a lot more detail on when things are happening and how long everything's taking.

Hi Dave, sorry for the late response. We will try with the settings you suggested and provide you the logs.

1 Like

Hi Dave, we have the logs. Is there a way to upload them somewhere or let me know what look for?

Sure, I sent you a private message with an upload link.

I don't think you correctly applied the logger config I mentioned above. The logs only contain this:

[2021-02-03T14:53:54,566][INFO ][o.e.r.RepositoriesService] [REDACTED] put repository [REDACTED]
[2021-02-03T14:53:54,803][INFO ][o.e.s.SnapshotsService   ] [REDACTED] snapshot [REDACTED] started
[2021-02-03T14:57:42,611][INFO ][o.e.s.SnapshotsService   ] [REDACTED] snapshot [REDACTED] completed with state [SUCCESS]

There should be a good deal of additional tracing statements in between these two lines.

Also, what exactly is the repository you're using? Is it a S3 bucket or is it something supposedly "S3-compatible"?

Also also, the timestamps indicate that this snapshot took a little under 4 minutes to complete, much less than the 10 minutes you originally reported. Does this mean the problem is fixed?

I believe we have applied the setting correctly. We are using the Data Domain as the respository.

In which case there's something wrong with your logging setup, since the logs you shared contain no trace information at all.

As in the Dell EMC product? If so, we've definitely encountered performance issues with them in the past. Elasticsearch's access pattern did change between 6.8 and 7.10 and the new access pattern may not be very well-supported in the repository you're using. I would recommend running tests against real S3 to rule out the nonstandard repository as the source of problems first.

I will recheck the log settings to ensure that they are correct. Yes, this is a Dell EMC product and we are from the same division as the Data Domain product. This is an enterprise product and I made changes to write it to S3.

Where can I learn about the changes to Elastic's access pattern from 6.8 to 7.10. Appreciate any information on that. Thank you.

You will be able to see the access pattern in the trace logs, and I believe you can log each access on the Data Domain side too. The details of the changes aren't really documented anywhere public, they're very much an implementation detail, but the main change I'm aware of is the heavier use of concurrent uploads. This gives a big performance boost on proper S3 buckets but I could believe that some third-party repositories might perform worse under a more concurrent workload.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.