GCS repository creation times out on one node (starts working after restart)

sumant-pangotra · May 27, 2024, 12:46pm

{
  "name" : "elasticsearch-client-6fdc44747f-2h2md",
  "cluster_name" : "es-123",
  "cluster_uuid" : "dfQbgOeVTXieyW3JUWcEQw",
  "version" : {
    "number" : "6.3.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "424e937",
    "build_date" : "2018-06-11T23:38:03.357887Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

While creating repository following exception is observed:

[2024-05-27T11:49:49,327][WARN ][o.e.r.RepositoriesService] [elasticsearch-sc-mdata-2] failed to create repository [test-gc]
org.elasticsearch.repositories.RepositoryException: [test-gc] failed to create repository
	at org.elasticsearch.repositories.RepositoriesService.createRepository(RepositoriesService.java:388) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.repositories.RepositoriesService.applyClusterState(RepositoriesService.java:303) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:496) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_151]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:493) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:480) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:431) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.common.blobstore.BlobStoreException: Unable to check if bucket [elasticsearch-backup] exists
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.doesBucketExist(GoogleCloudStorageBlobStore.java:118) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.<init>(GoogleCloudStorageBlobStore.java:75) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageRepository.<init>(GoogleCloudStorageRepository.java:137) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin.lambda$getRepositories$1(GoogleCloudStoragePlugin.java:129) ~[?:?]
	at org.elasticsearch.repositories.RepositoriesService.createRepository(RepositoriesService.java:383) ~[elasticsearch-6.3.0.jar:6.3.0]
	... 13 more
Caused by: java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_151]
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_151]
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_151]
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_151]
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_151]
	at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_151]
	at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673) ~[?:?]
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) ~[?:?]
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162) ~[?:?]
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:104) ~[?:?]
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.lambda$doesBucketExist$0(GoogleCloudStorageBlobStore.java:104) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_151]
	at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:44) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.doesBucketExist(GoogleCloudStorageBlobStore.java:102) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.<init>(GoogleCloudStorageBlobStore.java:75) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageRepository.<init>(GoogleCloudStorageRepository.java:137) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin.lambda$getRepositories$1(GoogleCloudStoragePlugin.java:129) ~[?:?]
	at org.elasticsearch.repositories.RepositoriesService.createRepository(RepositoriesService.java:383) ~[elasticsearch-6.3.0.jar:6.3.0]
	... 13 more

The repo creation and snapshot were working on same node for last many days

sumant-pangotra · May 28, 2024, 6:34am

More information

Of multiple nodes, this issue is seen only in some nodes, resulting in Partial snapshots.
Google Support says

I investigated the issue that you are encountering on Google Cloud Storage side (such as permission errors, connection aborted and broken connection error, latency problems etc.) but could not spot any issues. Looks like the error occurs before even reaching the GCS bucket.

dadoonet · May 29, 2024, 4:58am

This version has been released 6 years ago and you did not even updated to 6.8. You must definitely switch to a more robust version like 7.17 or better 8.13.

sumant-pangotra · May 29, 2024, 5:59am

Hi @dadoonet

Thank you for your reply. An update is in pipeline but it will take some time. Appreciate any help to understand this problem.

Version may be an issue but since the problem started happening in 20+ different cluster (hosted across different regions and GCP projects each having their own backup bucket) almost at same time makes it difficult to understand.

dadoonet · May 29, 2024, 6:27am

As it's a networking issue apparently, I'd double check that part. May be there's like a firewall somewhere?
I'd try to access the buckets from the CLI on one of the nodes. Just to check what is happening.

sumant-pangotra · May 29, 2024, 7:07am

I tried Curl directly to the bucket from the affected node and it is working as expected.
Also of the three data nodes only one has this problem other two nodes are connecting and writing the data resulting in partial snapshots

If we restart the affected node, it starts working
I suspect it had cached some information at node level maybe IP or some certificates

dadoonet · May 29, 2024, 12:39pm

Yeah. And as it's an old jvm as well it might be missing root certificates or something like that...

Glad you solved it.

sumant-pangotra · May 29, 2024, 12:42pm

But in such case a restart will not solve the problem

sumant-pangotra · June 3, 2024, 7:24am

Hi @dadoonet
Can we discuss some more possibilities? Would appreciate your help

dadoonet · June 3, 2024, 8:58am

What else do you need?

You wrote:

If we restart the affected node, it starts working

So I guess that all is good now.

sumant-pangotra · June 5, 2024, 9:11am

This is second time, issue has happened in last 6 months across 20+ clusters.
Restarting many nodes becomes a major task.
I want to know if, somehow we can prevent this from happening, and to come to that knowledge, we need to figure out why is it happening

dadoonet · June 5, 2024, 9:44am

Yes. As I said earlier:

This version has been released 6 years ago and you did not even updated to 6.8. You must definitely switch to a more robust version like 7.17 or better 8.13.

Topic		Replies	Views
GCS repository backup issue Elasticsearch	3	424	November 27, 2018
Elasticsearch 5.5.1 not able to create repository in GCS Elasticsearch	1	405	September 22, 2020
Snapshot to GCS repository times out Elasticsearch	14	2735	June 2, 2017
[my_gcs_repository] repository type [gcs] does not exist Elasticsearch	2	3104	August 21, 2017
Error Creating S3 Repository Elasticsearch snapshot-and-restore	1	534	December 12, 2023

GCS repository creation times out on one node (starts working after restart)

Related topics