Snapshot performance with ES 7.10

karunelas · January 11, 2021, 3:27pm

Hi all:
We have upgraded to ES 7.10 from 6.8. After the upgrade, we are observing that the time taken for the Snapshot on 7.10 is around 3 times slower compared to the snapshots with 6.8. These are snapshots of the ES cluster to an NFS mounted share. There is no change in the configuration from 6.8 to 7.10. I was wondering if anyone has this behavior. If so, are there any settings that need to be configured in ES 7.10 to get back the previous performance?

Thanks.

DavidTurner · January 11, 2021, 6:01pm

Is it still slow if you move to a fresh repository?

karunelas · January 11, 2021, 7:00pm

Hi Dave,
The slowness is happening on a fresh repository as well with a new (greenfield) installation of ES 7.10.

Thanks,
-Karun

DavidTurner · January 12, 2021, 5:35am

Ok, thanks for checking that.

I'm not aware of any performance regression in this area. Can you quantify it in more absolute terms? How large/complex a snapshot are you taking (data volume, number of shards, any other pertinent info) and how long does it take?

karunelas · January 12, 2021, 5:56pm

Here are the details that you are asking for:

=6.8 ES=

curl get xxxxx:9200/_cat/snapshots/SnapshotRepo_1?v&s=id
id                                    status start_epoch start_time end_epoch  end_time duration indices successful_shards failed_shards total_shards
eca95dae-1774-40be-921e-9311e5206263 SUCCESS 1609997090  05:24:50   1609997354 05:29:14    4.3m        3                12             0           12

=7.10 ES=
curl get xxxx:9200/_cat/snapshots/SnapshotRepo_2?v&s=id
id                                     status start_epoch start_time end_epoch end_time duration indices successful_shards failed_shards total_shards
eddfcb8c-02ab-4dc6-94e9-ab654a60ba51 SUCCESS 1610002940  07:02:20   1610003581 07:13:01    10.6m       3                12             0           12

DavidTurner · January 12, 2021, 5:59pm

How large were these indices?

DavidTurner · January 12, 2021, 6:03pm

Also note that snapshots will re-use previously-snapshotted data where possible, so if the repository wasn't empty then that will confound your measurements too.

karunelas · January 12, 2021, 6:13pm

The repository was empty as we trying to compare performance of 6.8 vs 71.0.
Also, there are 3 indices and only one of them is big, maybe around 6.2.GB. The other ones are in few MB or even smaller.

Thanks,
Karun

DavidTurner · January 12, 2021, 7:07pm

Thanks. Yes, 10 minutes does seem longer than expected to make a ~6GB snapshot. The only relevant setting I can think of is max_snapshot_bytes_per_sec which defaults to 40mb, i.e. 40MB/s, but of course that applies to both versions. There have been quite a lot of changes to how snapshots work between 6.8 and 7.10 but (as I said) I'm not aware of any that would cause such a performance drop.

Unless anyone else has better ideas I think you'll need to share some logs from every node, with these settings:

logger.org.elasticsearch.repositories: TRACE
logger.org.elasticsearch.snapshots: TRACE
logger.org.elasticsearch.cluster.service.MasterService: DEBUG

That will show a lot more detail on when things are happening and how long everything's taking.

karunelas · January 15, 2021, 7:05pm

Hi Dave, sorry for the late response. We will try with the settings you suggested and provide you the logs.

karunelas · February 10, 2021, 5:42pm

Hi Dave, we have the logs. Is there a way to upload them somewhere or let me know what look for?

DavidTurner · February 11, 2021, 10:37am

Sure, I sent you a private message with an upload link.

DavidTurner · February 11, 2021, 11:03am

I don't think you correctly applied the logger config I mentioned above. The logs only contain this:

[2021-02-03T14:53:54,566][INFO ][o.e.r.RepositoriesService] [REDACTED] put repository [REDACTED]
[2021-02-03T14:53:54,803][INFO ][o.e.s.SnapshotsService   ] [REDACTED] snapshot [REDACTED] started
[2021-02-03T14:57:42,611][INFO ][o.e.s.SnapshotsService   ] [REDACTED] snapshot [REDACTED] completed with state [SUCCESS]

There should be a good deal of additional tracing statements in between these two lines.

Also, what exactly is the repository you're using? Is it a S3 bucket or is it something supposedly "S3-compatible"?

Also also, the timestamps indicate that this snapshot took a little under 4 minutes to complete, much less than the 10 minutes you originally reported. Does this mean the problem is fixed?

karunelas · February 11, 2021, 11:35am

I believe we have applied the setting correctly. We are using the Data Domain as the respository.

DavidTurner · February 11, 2021, 12:43pm

In which case there's something wrong with your logging setup, since the logs you shared contain no trace information at all.

As in the Dell EMC product? If so, we've definitely encountered performance issues with them in the past. Elasticsearch's access pattern did change between 6.8 and 7.10 and the new access pattern may not be very well-supported in the repository you're using. I would recommend running tests against real S3 to rule out the nonstandard repository as the source of problems first.

karunelas · February 11, 2021, 1:00pm

I will recheck the log settings to ensure that they are correct. Yes, this is a Dell EMC product and we are from the same division as the Data Domain product. This is an enterprise product and I made changes to write it to S3.

Where can I learn about the changes to Elastic's access pattern from 6.8 to 7.10. Appreciate any information on that. Thank you.

DavidTurner · February 11, 2021, 1:34pm

You will be able to see the access pattern in the trace logs, and I believe you can log each access on the Data Domain side too. The details of the changes aren't really documented anywhere public, they're very much an implementation detail, but the main change I'm aware of is the heavier use of concurrent uploads. This gives a big performance boost on proper S3 buckets but I could believe that some third-party repositories might perform worse under a more concurrent workload.

system · March 11, 2021, 1:35pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Snapshot is very very slow Elasticsearch	3	1921	June 19, 2017
Snapshot running very slowly after upgrade from 5.x to 6.x Elasticsearch	6	376	February 25, 2020
Snapshot performance Elasticsearch	1	820	April 24, 2017
Upgrade ES 5.6.3 to ES 7.8.0: Slower queries. Trying to explain/find out why Elasticsearch	3	426	December 15, 2020
ES slower after memory upgrade Elasticsearch	4	625	September 7, 2017

Snapshot performance with ES 7.10

Related topics