Hi all:
We have upgraded to ES 7.10 from 6.8. After the upgrade, we are observing that the time taken for the Snapshot on 7.10 is around 3 times slower compared to the snapshots with 6.8. These are snapshots of the ES cluster to an NFS mounted share. There is no change in the configuration from 6.8 to 7.10. I was wondering if anyone has this behavior. If so, are there any settings that need to be configured in ES 7.10 to get back the previous performance?
I'm not aware of any performance regression in this area. Can you quantify it in more absolute terms? How large/complex a snapshot are you taking (data volume, number of shards, any other pertinent info) and how long does it take?
Also note that snapshots will re-use previously-snapshotted data where possible, so if the repository wasn't empty then that will confound your measurements too.
The repository was empty as we trying to compare performance of 6.8 vs 71.0.
Also, there are 3 indices and only one of them is big, maybe around 6.2.GB. The other ones are in few MB or even smaller.
Thanks. Yes, 10 minutes does seem longer than expected to make a ~6GB snapshot. The only relevant setting I can think of is max_snapshot_bytes_per_sec which defaults to 40mb, i.e. 40MB/s, but of course that applies to both versions. There have been quite a lot of changes to how snapshots work between 6.8 and 7.10 but (as I said) I'm not aware of any that would cause such a performance drop.
Unless anyone else has better ideas I think you'll need to share some logs from every node, with these settings:
I don't think you correctly applied the logger config I mentioned above. The logs only contain this:
[2021-02-03T14:53:54,566][INFO ][o.e.r.RepositoriesService] [REDACTED] put repository [REDACTED]
[2021-02-03T14:53:54,803][INFO ][o.e.s.SnapshotsService ] [REDACTED] snapshot [REDACTED] started
[2021-02-03T14:57:42,611][INFO ][o.e.s.SnapshotsService ] [REDACTED] snapshot [REDACTED] completed with state [SUCCESS]
There should be a good deal of additional tracing statements in between these two lines.
Also, what exactly is the repository you're using? Is it a S3 bucket or is it something supposedly "S3-compatible"?
Also also, the timestamps indicate that this snapshot took a little under 4 minutes to complete, much less than the 10 minutes you originally reported. Does this mean the problem is fixed?
In which case there's something wrong with your logging setup, since the logs you shared contain no trace information at all.
As in the Dell EMC product? If so, we've definitely encountered performance issues with them in the past. Elasticsearch's access pattern did change between 6.8 and 7.10 and the new access pattern may not be very well-supported in the repository you're using. I would recommend running tests against real S3 to rule out the nonstandard repository as the source of problems first.
I will recheck the log settings to ensure that they are correct. Yes, this is a Dell EMC product and we are from the same division as the Data Domain product. This is an enterprise product and I made changes to write it to S3.
Where can I learn about the changes to Elastic's access pattern from 6.8 to 7.10. Appreciate any information on that. Thank you.
You will be able to see the access pattern in the trace logs, and I believe you can log each access on the Data Domain side too. The details of the changes aren't really documented anywhere public, they're very much an implementation detail, but the main change I'm aware of is the heavier use of concurrent uploads. This gives a big performance boost on proper S3 buckets but I could believe that some third-party repositories might perform worse under a more concurrent workload.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.