Elastic-operator requests regularly time out

Hi,

I'm using ECK on GKE. Almost every time I make any modifications (initial apply, edit) to the elasticsearch object, it times out. But only barely! If I increase the timeout, the command usually succeeds in 30.1s. For example, here's the output of kubectl edit es elasticsearch-eck --request-timeout=60s:

I0212 15:26:57.363492 68694 round_trippers.go:438] PUT https://<master>/apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck?timeout=1m0s 200 OK in 30093 milliseconds

This isn't so bad locally, because I can easily bump the timeout to kubectl. However, it seems to also affect the elastic-operator, which also experiences the timeouts.

For example, here I'm trying to reduce the number of nodes. But the operator is stuck failing to update the annotation for minimum_master_nodes.

...
E 2020-02-12T18:45:51.190380453Z Updating minimum master nodes 
I 2020-02-12T18:45:51.203436Z Request Body:  <trimmed>
I 2020-02-12T18:45:51.203671Z curl -k -v -XPUT  -H "Accept: application/json, */*" -H "Content-Type: application/json" -H "User-Agent: elastic-operator/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer  <trimmed>" 'https://10.0.16.1:443/apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck' 
E 2020-02-12T18:45:51.505008214Z Retrieving cluster state 
E 2020-02-12T18:46:01.504865401Z Retrieving cluster state 
E 2020-02-12T18:46:11.505025417Z Retrieving cluster state 
I 2020-02-12T18:46:21.206976Z PUT https://10.0.16.1:443/apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck 504 Gateway Timeout in 30003 milliseconds 
I 2020-02-12T18:46:21.207016Z Response Headers: 
I 2020-02-12T18:46:21.207022Z     Audit-Id: a9ab3f93-c36c-4c28-a224-f97a03001822 
I 2020-02-12T18:46:21.207027Z     Content-Type: application/json 
I 2020-02-12T18:46:21.207030Z     Content-Length: 187 
I 2020-02-12T18:46:21.207034Z     Date: Wed, 12 Feb 2020 18:46:21 GMT 
I 2020-02-12T18:46:21.207082Z Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Timeout: request did not complete within requested timeout 30s","reason":"Timeout","details":{},"code":504} 
E 2020-02-12T18:46:21.207784068Z Ending reconciliation run 
E 2020-02-12T18:46:21.207793754Z Reconciler error 
...

This repeats for many hours and only occasionally, magically, gets through and makes progress.

Is there a way to bump this timeout? Or is there something else to look into regarding why these operations seem to take exactly the wrong amount of time?

I don't think there's anything terribly fancy in the config:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-eck
spec:
  version: 6.8.6
  nodeSets:
  - name: default
    count: 3
    config:
      node.master: true
      node.data: true
      node.ingest: true
      processors: 8
      reindex.remote.whitelist: "*:9200"
      thread_pool.index.queue_size: 500
      thread_pool.write.queue_size: 500
      xpack.security.authc:
        anonymous:
          username: anonymous
          roles: superuser
          authz_exception: false
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            requests:
              memory: 44Gi
            limits:
              memory: 44Gi
          env:
          - name: ES_JAVA_OPTS
            value: -Xmx12g -Xms12g -XX:-UseParallelGC -XX:-UseConcMarkSweepGC -XX:+UseG1GC
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 100Gi
        storageClassName: standard
  http:
    service:
      metadata:
        annotations:
          cloud.google.com/load-balancer-type: Internal
      spec:
        type: LoadBalancer
    tls:
      selfSignedCertificate:
        disabled: true

Thanks,
James

Hi @simonjam, thanks for your message. Do you have a validating webhook deployed? You can check that with the below command.

kubectl get validatingwebhookconfigurations elastic-webhook.k8s.elastic.co

It seems that there might be an issue in connectivity between api-server and the operator (where the webhook is running) - you can confirm that by looking at the api-server logs. The allowed timeout for the webhook is 30 seconds and, by default, the policy is to ignore failures to connect, but only after those 30 seconds have passed. As default timeout for a request is also 30 seconds it's not enough for the admission process to complete. If the above is the case, you can either:

  • investigate the connectivity issues (maybe some network policy is blocking that connection), or
  • decrease the default timeout on the webhook, or
  • increase the timeout on the client (as you did), or
  • remove the webhook.

Edit:

It seems that I skimmed too quickly through second part of your post. If the root cause is the same in both cases you are reporting than the webhook is not that root cause. Is Elasticsearch cluster green when this is happening? Can you exec into operator pod and curl against Elasticsearch?

Thanks for your reply. The cluster seems to be fine during all of this. The status is green. Checking /_cluster/settings on the cluster shows that the minimum nodes change was successful. It's just that the annotation couldn't be updated on the K8s resource.

The time between the Updating minimum master nodes and the subsequent PUT is minimal, indicating SetMinimumMasterNodes() completed quickly too, so there doesn't seem to be a communication problem between the operator and the cluster.

Another data point: updating other K8s resources (i.e. not the elasticsearch resource) works fine. For example, the http service updated in 8ms:

E 2020-02-12T18:45:50.422888724Z Updating resource 
I 2020-02-12T18:45:50.423003Z curl -k -v -XPUT  -H "Accept: application/json, */*" -H "Content-Type: application/json" -H "User-Agent: elastic-operator/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <trimmed>" 'https://10.0.16.1:443/api/v1/namespaces/default/services/elasticsearch-eck-es-http' 
I 2020-02-12T18:45:50.431823Z PUT https://10.0.16.1:443/api/v1/namespaces/default/services/elasticsearch-eck-es-http 200 OK in 8 milliseconds 

So it's really only updating the custom elasticsearch resource that's taking 30.1s. I don't know enough about K8s to debug what's happening when a custom resource is mutated. Any pointers? Is it possible that triggers something wrt the webhook?

Thanks again,
James

It's not clear to me that we can rule out the webhook causing issues. Sometimes the API server logs can help diagnose what is happening in that case, it may be worth looking at them if you have access to them. On GKE I think the quickest way is

kubectl proxy
curl http://localhost:8001/logs/kube-apiserver.log

Another option is to have the webhook fail faster, which at least will give you a quicker feedback loop. If you edit the timeoutSeconds of the validatingwebhookconfiguration named elastic-webhook.k8s.elastic.co to 10s it should still have enough time to run but will error out sooner.

Bingo! It's definitely the webhook timing out:

W0214 00:24:29.162595       1 dispatcher.go:68] Failed calling webhook, failing open elastic-es-validation-v1.k8s.elastic.co: failed calling webhook "elastic-es-validation-v1.k8s.elastic.co": Post https://elastic-webhook-server.elastic-system.svc:443/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch?timeout=10s: context deadline exceeded
E0214 00:24:29.162676       1 dispatcher.go:69] failed calling webhook "elastic-es-validation-v1.k8s.elastic.co": Post https://elastic-webhook-server.elastic-system.svc:443/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch?timeout=10s: context deadline exceeded
I0214 00:24:29.162747       1 controller.go:606] quota admission added evaluator for: elasticsearches.elasticsearch.k8s.elastic.co
I0214 00:24:29.168670       1 trace.go:81] Trace[549461993]: "GuaranteedUpdate etcd3: *unstructured.Unstructured" (started: 2020-02-14 00:24:19.157982255 +0000 UTC m=+3370.129443196) (total time: 10.010648204s):
Trace[549461993]: [10.005073097s] [10.004863539s] Transaction prepared
I0214 00:24:29.169161       1 trace.go:81] Trace[655951107]: "Patch /apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck" (started: 2020-02-14 00:24:19.157768533 +0000 UTC m=+3370.129229421) (total time: 10.011356215s):
Trace[655951107]: [10.010946995s] [10.008463021s] Object stored in database
I0214 00:24:29.170010       1 wrap.go:47] PATCH /apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck: (10.04001539s) 200 [kubectl/v1.14.10 (darwin/amd64) kubernetes/f5757a1 12.169.99.90:11055]

I logged into the operator and checked connectivity on the webhook. It seems fine:

> kubectl exec -it elastic-operator-0 -n elastic-system sh
sh-4.2$ curl -k 'https://elastic-webhook-server.elastic-system.svc:443/validate-elasticsearch-k8s-elastic-co-v1-elasticsearch?timeout=10s'
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}

And I do see some signs of the webhook running validation in that window.

I 2020-02-14T00:23:58.856727Z Response Body: {"kind":"Secret","apiVersion":"v1","metadata":{"name":"elastic-webhook-server-cert","namespace":"elastic-system","selfLink":"/api/v1/namespaces/elastic-system/secrets/elastic-webhook-server-cert","uid":"2d7a146b-3e1a-11ea-a147-42010a020114","resourceVersion":"363426","creationTimestamp":"2020-01-23T19:54:34Z","annotations":{"kubectl.kubernetes.io/last-applied-configuration":"{\"apiVersion\":\"v1\",\"kind\":\"Secret\",\"metadata\":{\"annotations\":{},\"name\":\"elastic-webhook-server-cert\",\"namespace\":\"elastic-system\"}}\n"}},"data":{"tls.crt":"<trimmed>"},"type":"Opaque"} 
E 2020-02-14T00:23:58.857225971Z Ending reconciliation run 
E 2020-02-14T00:24:05.719645818Z Retrieving cluster state 
E 2020-02-14T00:24:15.719642037Z Retrieving cluster state 
E 2020-02-14T00:24:22.224665586Z Collecting 
E 2020-02-14T00:24:22.224969500Z Saving 
I 2020-02-14T00:24:22.225065Z Request Body: {"kind":"ConfigMap","apiVersion":"v1","metadata":{"name":"elastic-licensing","namespace":"elastic-system","creationTimestamp":null,"labels":{"common.k8s.elastic.co/type":"elastic-usage"}},"data":{"eck_license_level":"basic","enterprise_resource_units":"3","timestamp":"2020-02-14T00:24:22Z","total_managed_memory":"141.73GB"}} 
I 2020-02-14T00:24:22.225311Z curl -k -v -XPUT  -H "Content-Type: application/json" -H "User-Agent: elastic-operator/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <trimmed>" -H "Accept: application/json, */*" 'https://10.0.16.1:443/api/v1/namespaces/elastic-system/configmaps/elastic-licensing' 
I 2020-02-14T00:24:22.231163Z PUT https://10.0.16.1:443/api/v1/namespaces/elastic-system/configmaps/elastic-licensing 200 OK in 5 milliseconds 
I 2020-02-14T00:24:22.231194Z Response Headers: 
I 2020-02-14T00:24:22.231201Z     Audit-Id: 5ebfff5b-d919-40fc-8b07-c8b417228822 
I 2020-02-14T00:24:22.231206Z     Content-Type: application/json 
I 2020-02-14T00:24:22.231212Z     Content-Length: 494 
I 2020-02-14T00:24:22.231216Z     Date: Fri, 14 Feb 2020 00:24:22 GMT 
I 2020-02-14T00:24:22.231249Z Response Body: {"kind":"ConfigMap","apiVersion":"v1","metadata":{"name":"elastic-licensing","namespace":"elastic-system","selfLink":"/api/v1/namespaces/elastic-system/configmaps/elastic-licensing","uid":"3a9a56fe-3e1a-11ea-a147-42010a020114","resourceVersion":"6328158","creationTimestamp":"2020-01-23T19:54:56Z","labels":{"common.k8s.elastic.co/type":"elastic-usage"}},"data":{"eck_license_level":"basic","enterprise_resource_units":"3","timestamp":"2020-02-14T00:24:22Z","total_managed_memory":"141.73GB"}} 
E 2020-02-14T00:24:25.719660086Z Retrieving cluster state 
E 2020-02-14T00:24:29.133470268Z Starting reconciliation run 
E 2020-02-14T00:24:29.133541440Z Starting reconciliation run 
E 2020-02-14T00:24:29.133775052Z validate create 

There doesn't seem to be much more logging in the webhook, so it's not clear whether it's just slow or is silently failing in there. I guess I can add some logging and see what it's doing, but if you've got other suggestions let me know.

In the meantime, lowering the timeout is enough of a workaround.

Thanks,
James

That's surprising to me that the request is getting to the webhook (the validate create message in the operator logs), but the API server is not getting the response back. We are pretty light on the logging around this functionality, but also the validation logic is not too complicated and doesn't make external calls so I would be surprised if it's silently failing inside the operator.

Are you maybe running a private GKE cluster? If so, they describe how to add firewall rules which may be necessary:
https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters#add_firewall_rules

And also see an upstream kubernetes issue: https://github.com/kubernetes/kubernetes/issues/79739

For what it's worth, I cannot reproduce it using this GKE configuration (i.e. the webhook works as expected):

You're spot on Anya. That's exactly the issue. Port 9443 needs to be opened up when using private GKE clusters. I definitely agree with the comment on that GitHub issue about how obtuse this is.

Thanks a ton for your help!

James