GCS repository creation times out on one node (starts working after restart)

GCS repository creation times out on one node (starts working after restart)

{
  "name" : "elasticsearch-client-6fdc44747f-2h2md",
  "cluster_name" : "es-123",
  "cluster_uuid" : "dfQbgOeVTXieyW3JUWcEQw",
  "version" : {
    "number" : "6.3.0",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "424e937",
    "build_date" : "2018-06-11T23:38:03.357887Z",
    "build_snapshot" : false,
    "lucene_version" : "7.3.1",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

While creating repository following exception is observed:

[2024-05-27T11:49:49,327][WARN ][o.e.r.RepositoriesService] [elasticsearch-sc-mdata-2] failed to create repository [test-gc]
org.elasticsearch.repositories.RepositoryException: [test-gc] failed to create repository
	at org.elasticsearch.repositories.RepositoriesService.createRepository(RepositoriesService.java:388) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.repositories.RepositoriesService.applyClusterState(RepositoriesService.java:303) ~[elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:496) ~[elasticsearch-6.3.0.jar:6.3.0]
	at java.lang.Iterable.forEach(Iterable.java:75) [?:1.8.0_151]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:493) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:480) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:431) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:161) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:625) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244) [elasticsearch-6.3.0.jar:6.3.0]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207) [elasticsearch-6.3.0.jar:6.3.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.elasticsearch.common.blobstore.BlobStoreException: Unable to check if bucket [elasticsearch-backup] exists
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.doesBucketExist(GoogleCloudStorageBlobStore.java:118) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.<init>(GoogleCloudStorageBlobStore.java:75) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageRepository.<init>(GoogleCloudStorageRepository.java:137) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin.lambda$getRepositories$1(GoogleCloudStoragePlugin.java:129) ~[?:?]
	at org.elasticsearch.repositories.RepositoriesService.createRepository(RepositoriesService.java:383) ~[elasticsearch-6.3.0.jar:6.3.0]
	... 13 more
Caused by: java.net.SocketTimeoutException: connect timed out
	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:1.8.0_151]
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350) ~[?:1.8.0_151]
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206) ~[?:1.8.0_151]
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188) ~[?:1.8.0_151]
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) ~[?:1.8.0_151]
	at java.net.Socket.connect(Socket.java:589) ~[?:1.8.0_151]
	at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:673) ~[?:?]
	at sun.net.NetworkClient.doConnect(NetworkClient.java:175) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:463) ~[?:?]
	at sun.net.www.http.HttpClient.openServer(HttpClient.java:558) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264) ~[?:?]
	at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156) ~[?:?]
	at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050) ~[?:?]
	at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) ~[?:?]
	at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162) ~[?:?]
	at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:104) ~[?:?]
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:981) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352) ~[?:?]
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.lambda$doesBucketExist$0(GoogleCloudStorageBlobStore.java:104) ~[?:?]
	at java.security.AccessController.doPrivileged(Native Method) ~[?:1.8.0_151]
	at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:44) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.doesBucketExist(GoogleCloudStorageBlobStore.java:102) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.<init>(GoogleCloudStorageBlobStore.java:75) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageRepository.<init>(GoogleCloudStorageRepository.java:137) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStoragePlugin.lambda$getRepositories$1(GoogleCloudStoragePlugin.java:129) ~[?:?]
	at org.elasticsearch.repositories.RepositoriesService.createRepository(RepositoriesService.java:383) ~[elasticsearch-6.3.0.jar:6.3.0]
	... 13 more

The repo creation and snapshot were working on same node for last many days

More information

Of multiple nodes, this issue is seen only in some nodes, resulting in Partial snapshots.
Google Support says

I investigated the issue that you are encountering on Google Cloud Storage side (such as permission errors, connection aborted and broken connection error, latency problems etc.) but could not spot any issues. Looks like the error occurs before even reaching the GCS bucket.

This version has been released 6 years ago and you did not even updated to 6.8. You must definitely switch to a more robust version like 7.17 or better 8.13.

1 Like

Hi @dadoonet

Thank you for your reply. An update is in pipeline but it will take some time. Appreciate any help to understand this problem.

Version may be an issue but since the problem started happening in 20+ different cluster (hosted across different regions and GCP projects each having their own backup bucket) almost at same time makes it difficult to understand.

As it's a networking issue apparently, I'd double check that part. May be there's like a firewall somewhere?
I'd try to access the buckets from the CLI on one of the nodes. Just to check what is happening.

I tried Curl directly to the bucket from the affected node and it is working as expected.
Also of the three data nodes only one has this problem other two nodes are connecting and writing the data resulting in partial snapshots

If we restart the affected node, it starts working
I suspect it had cached some information at node level maybe IP or some certificates

Yeah. And as it's an old jvm as well it might be missing root certificates or something like that...

Glad you solved it.

But in such case a restart will not solve the problem

Hi @dadoonet
Can we discuss some more possibilities? Would appreciate your help

What else do you need?

You wrote:

If we restart the affected node, it starts working

So I guess that all is good now.

This is second time, issue has happened in last 6 months across 20+ clusters.
Restarting many nodes becomes a major task.
I want to know if, somehow we can prevent this from happening, and to come to that knowledge, we need to figure out why is it happening

Yes. As I said earlier:

This version has been released 6 years ago and you did not even updated to 6.8. You must definitely switch to a more robust version like 7.17 or better 8.13.

1 Like