Repository-gcs high amount of requests

I followed the repository-gcs guide to install a snapshot repository for gcs a month ago. https://www.elastic.co/guide/en/elasticsearch/plugins/7.5/repository-gcs.html

POST http://elasticsearch-cluster:9200/_snapshot/daily
{"type": "gcs",
"settings": {"bucket": "my-bucket",
                     "base-path": "es-backup-static"}
}

Everything worked fine since beginning of June. We are creating a backup every 6 hours of our elasticsearch cluster with the corresponding timestamp as name.

PUT "http://elasticsearch-cluster:9200/_snapshot/daily/2020-07-18-12-00-00"
 { "indices": "index-ioc,index-file",
"include_global_state": false}

Since last week we are facing an issue which leads to high load on the gcs bucket. When enable the daily snapshot repository, the amount of requests/second are raising up to 500. After removing the repository the amount of requests is going down again.


Of course gcs does not allow such a high amount of requests and we get http 429 errors. Currently its not even possible anymore to enable the repository and to fetch the snapshots with:

GET http://elasticsearch-cluster:9200/_cat/snapshots/daily

Response:

{"error":{"root_cause":[{"type":"snapshot_exception","reason":"
[tmp:2020-06-28-00-01-03/Tkf58STgRNun5P_hAaIo0Q] Snapshot could not be read"}],"type":"snapshot_exception","reason":"[tmp:2020-06-28-00-01-03/Tkf58STgRNun5P_hAaIo0Q] Snapshot could not be read","caused_by":{"type":"storage_exception","reason":"429 Too Many Requests\nThe project exceeded the rate limit for CloudKMS decrypt operations.","caused_by":{"type":"google_json_response_exception","reason":"429 Too Many Requests\nThe project exceeded the rate limit for CloudKMS decrypt operations."}}},"status":500}

In the log we see the following messages then:

es-1 elasticsearch {"type": "server", "timestamp": "2020-07-20T13:35:27,511Z", "level": "WARN", "component": "r.suppressed", "cluster.name": "es", "node.name": "es-1", "message": "path: /_snapshot/daily/_all, params: {repository=daily, snapshot=_all}", "cluster.uuid": "dGPxO0YdQDioz1rvpigPfg", "node.id": "nF6-bsDdRxqkcpYBnYvv8g" , 
    es-1 elasticsearch "stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [es-0][10.22.20.83:9300][cluster:admin/snapshot/get]",
    es-1 elasticsearch "Caused by: org.elasticsearch.snapshots.SnapshotException: [daily:2020-07-06-06-00-09/C3WiyagzQyqNqKQ_XF7dYQ] Snapshot could not be read",
    es-1 elasticsearch "at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:216) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:136) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:56) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:98) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:169) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
    es-1 elasticsearch "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
    es-1 elasticsearch "at java.lang.Thread.run(Thread.java:830) [?:?]",
    es-1 elasticsearch "Caused by: org.elasticsearch.common.io.stream.NotSerializableExceptionWrapper: storage_exception: 429 Too Many Requests",
    es-1 elasticsearch "The project exceeded the rate limit for CloudKMS decrypt operations.",
    es-1 elasticsearch "at com.google.cloud.storage.spi.v1.HttpStorageRpc.translate(HttpStorageRpc.java:227) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.spi.v1.HttpStorageRpc.read(HttpStorageRpc.java:690) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:127) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:124) ~[?:?]",
    es-1 elasticsearch "at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.RetryHelper.run(RetryHelper.java:76) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.BlobReadChannel.read(BlobReadChannel.java:123) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.lambda$read$0(GoogleCloudStorageBlobStore.java:189) ~[?:?]",
    es-1 elasticsearch "at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:44) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.read(GoogleCloudStorageBlobStore.java:189) ~[?:?]",
    es-1 elasticsearch "at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]",
    es-1 elasticsearch "at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]",
    es-1 elasticsearch "at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]",
    es-1 elasticsearch "at java.io.InputStream.read(InputStream.java:213) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.common.io.Streams.doCopy(Streams.java:86) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.common.io.Streams.copy(Streams.java:79) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.common.io.Streams.copy(Streams.java:61) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.common.io.Streams.readFully(Streams.java:252) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.readBlob(ChecksumBlobStoreFormat.java:135) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.repositories.blobstore.ChecksumBlobStoreFormat.read(ChecksumBlobStoreFormat.java:114) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getSnapshotInfo(BlobStoreRepository.java:754) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.snapshots.SnapshotsService.snapshots(SnapshotsService.java:208) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:136) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.admin.cluster.snapshots.get.TransportGetSnapshotsAction.masterOperation(TransportGetSnapshotsAction.java:56) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.support.master.TransportMasterNodeAction.masterOperation(TransportMasterNodeAction.java:98) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$3(TransportMasterNodeAction.java:169) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:73) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.1.jar:7.5.1]",
    es-1 elasticsearch "at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]",
    es-1 elasticsearch "at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]",
    es-1 elasticsearch "at java.lang.Thread.run(Thread.java:830) ~[?:?]",
    es-1 elasticsearch "Caused by: java.io.IOException: 429 Too Many Requests",
    es-1 elasticsearch "The project exceeded the rate limit for CloudKMS decrypt operations.",
    es-1 elasticsearch "at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150) ~[?:?]",
    es-1 elasticsearch "at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113) ~[?:?]",
    es-1 elasticsearch "at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40) ~[?:?]",
    es-1 elasticsearch "at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:417) ~[?:?]",
    es-1 elasticsearch "at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1089) ~[?:?]",
    es-1 elasticsearch "at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:515) ~[?:?]",
    es-1 elasticsearch "at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:448) ~[?:?]",
    es-1 elasticsearch "at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:476) ~[?:?]",
    es-1 elasticsearch "at com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6981) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.spi.v1.HttpStorageRpc.read(HttpStorageRpc.java:668) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:127) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.BlobReadChannel$1.call(BlobReadChannel.java:124) ~[?:?]",
    es-1 elasticsearch "at com.google.api.gax.retrying.DirectRetryingExecutor.submit(DirectRetryingExecutor.java:105) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.RetryHelper.run(RetryHelper.java:76) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.RetryHelper.runWithRetries(RetryHelper.java:50) ~[?:?]",
    es-1 elasticsearch "at com.google.cloud.storage.BlobReadChannel.read(BlobReadChannel.java:123) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.lambda$read$0(GoogleCloudStorageBlobStore.java:189) ~[?:?]",
    es-1 elasticsearch "at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:44) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.read(GoogleCloudStorageBlobStore.java:189) ~[?:?]",
    es-1 elasticsearch "at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:65) ~[?:?]",
    es-1 elasticsearch "at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109) ~[?:?]",
    es-1 elasticsearch "at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:103) ~[?:?]",
    es-1 elasticsearch "at java.io.InputStream.read(InputStream.java:213) ~[?:?]",
    es-1 elasticsearch "at org.elasticsearch.common.io.Streams.doCopy(Streams.java:86) ~[elasticsearch-7.5.1.jar:7.5.1]",
...

Did anyone already had this issue? How can I fix these requests spiking always? I have no idea why it worked for the last month, but now it does not anymore, without changing anything.

Environment:
GKE 1.16
Google Cloud Storage
Elasticsearch v7.5.1
3 Node cluster

How long are you keeping the snapshots?

Currently we keep our snapshots for 21 days. Is there an upper limit for the amount of snapshots which are supported?

Hi @chaekew2xie1ohh2aoYe

It seems you are simply hitting the rate limit GCS allows for KMS operations. Could be that the data in your cluster changed in a way that causes the snapshot to require slightly more uploads for example (or you just increased the size of you cluster ... more shards etc.).

The KMS docs have the information on how to up the quota for these operations at the bottom. Increasing those should fix things for you.

Hi @Armin_Braun

Thanks for you reply. I checked out this post and it seems to be clear, that with more shards, more data etc. it will exceed this quota, which it actually does. But from my point of view it should only exceed this quota during a snapshot and not when I create the repository or am I missing something?

I did a further check, I have an elasticsearch image without the repository-gcs plugin installed and an image with the plugin. When I switch the image to the one with the plugin, the requests are already increasing rapidly, without even querying something manually.

Right, only during snapshot creation should you see a higher request rate. Creating a repository should only take a trivial amount of requests in a 3 node cluster (literally ~10).

This is suspicious. The repository does not do any background work at all. Unless you are actively initiating an operation (create snapshot, list snapshot ...) you should not see any requests from ES to GCS at all. But the way I understand it you are seeing a large number of constant requests as soon as the repository exists?
This would likely be a result of some other process constantly hitting your repository/snapshot API(s). One easy way of tracking down what's causing this might be to go through your warning logs and check if you can find one that has TransportXXXAction for some XXX in it. If you find things like TransportSnapshotsStatusAction or TransportGetSnapshotsAction then I would look into where those requests are coming from.

Another option to track this down (given the large number and constant nature of these requests) would be to check the tasks api to check if you can find any snapshot status request related tasks there.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.