Hi,
I'm using ECK on GKE. Almost every time I make any modifications (initial apply, edit) to the elasticsearch object, it times out. But only barely! If I increase the timeout, the command usually succeeds in 30.1s. For example, here's the output of kubectl edit es elasticsearch-eck --request-timeout=60s
:
I0212 15:26:57.363492 68694 round_trippers.go:438] PUT https://<master>/apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck?timeout=1m0s 200 OK in 30093 milliseconds
This isn't so bad locally, because I can easily bump the timeout to kubectl. However, it seems to also affect the elastic-operator, which also experiences the timeouts.
For example, here I'm trying to reduce the number of nodes. But the operator is stuck failing to update the annotation for minimum_master_nodes.
...
E 2020-02-12T18:45:51.190380453Z Updating minimum master nodes
I 2020-02-12T18:45:51.203436Z Request Body: <trimmed>
I 2020-02-12T18:45:51.203671Z curl -k -v -XPUT -H "Accept: application/json, */*" -H "Content-Type: application/json" -H "User-Agent: elastic-operator/v0.0.0 (linux/amd64) kubernetes/$Format" -H "Authorization: Bearer <trimmed>" 'https://10.0.16.1:443/apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck'
E 2020-02-12T18:45:51.505008214Z Retrieving cluster state
E 2020-02-12T18:46:01.504865401Z Retrieving cluster state
E 2020-02-12T18:46:11.505025417Z Retrieving cluster state
I 2020-02-12T18:46:21.206976Z PUT https://10.0.16.1:443/apis/elasticsearch.k8s.elastic.co/v1/namespaces/default/elasticsearches/elasticsearch-eck 504 Gateway Timeout in 30003 milliseconds
I 2020-02-12T18:46:21.207016Z Response Headers:
I 2020-02-12T18:46:21.207022Z Audit-Id: a9ab3f93-c36c-4c28-a224-f97a03001822
I 2020-02-12T18:46:21.207027Z Content-Type: application/json
I 2020-02-12T18:46:21.207030Z Content-Length: 187
I 2020-02-12T18:46:21.207034Z Date: Wed, 12 Feb 2020 18:46:21 GMT
I 2020-02-12T18:46:21.207082Z Response Body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Timeout: request did not complete within requested timeout 30s","reason":"Timeout","details":{},"code":504}
E 2020-02-12T18:46:21.207784068Z Ending reconciliation run
E 2020-02-12T18:46:21.207793754Z Reconciler error
...
This repeats for many hours and only occasionally, magically, gets through and makes progress.
Is there a way to bump this timeout? Or is there something else to look into regarding why these operations seem to take exactly the wrong amount of time?
I don't think there's anything terribly fancy in the config:
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: elasticsearch-eck
spec:
version: 6.8.6
nodeSets:
- name: default
count: 3
config:
node.master: true
node.data: true
node.ingest: true
processors: 8
reindex.remote.whitelist: "*:9200"
thread_pool.index.queue_size: 500
thread_pool.write.queue_size: 500
xpack.security.authc:
anonymous:
username: anonymous
roles: superuser
authz_exception: false
podTemplate:
spec:
containers:
- name: elasticsearch
resources:
requests:
memory: 44Gi
limits:
memory: 44Gi
env:
- name: ES_JAVA_OPTS
value: -Xmx12g -Xms12g -XX:-UseParallelGC -XX:-UseConcMarkSweepGC -XX:+UseG1GC
initContainers:
- name: sysctl
securityContext:
privileged: true
command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: standard
http:
service:
metadata:
annotations:
cloud.google.com/load-balancer-type: Internal
spec:
type: LoadBalancer
tls:
selfSignedCertificate:
disabled: true
Thanks,
James