Hi,
I'm using ECK 0.8.0 to manage my es cluster in k8s. It started more than half year ago and worked fine. However, recently when I tried to scale up my data nodes ( from 5 to 8). Unexpected behaviours are observed.
Here's my original setup:
apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
kind: Elasticsearch
metadata:
name: async-search
namespace: elasticsearch
spec:
version: 7.1.0
nodes:
- nodeCount: 3
config:
node.master: true
node.data: false
node.ingest: false
#some pod template settings
# ...
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: local
selector:
matchLabels:
master: true
- nodeCount: 5
config:
node.master: false
node.data: true
node.ingest: false
# rest same as master
- nodeCount: 2
config:
node.master: false
node.data: false
node.ingest: false
# this is the coordinator node
# settings similar to above
updateStrategy:
changeBudget:
maxSurge: 0
maxUnavailable: 1
So basically I have 3 master nodes, 5 data nodes, 2 coordinator nodes.
Here's what happened when I trying to add more nodes to scale:
-
I tried to add 3 more coordinator nodes.
expected: ECK add 3 more coordinator nodes to the cluster
observed: ECK added 1 coordinator and 2 data nodes first and because I didn't provision any pv for data nodes. 2 of them keeps pending. -
I tried to add 3 more data nodes and 1 more coordinator nodes:
expected: ECK add 3 more data nodes and waits for data migrating to complete and then terminates one of the old nodes. starts another one. at some point, coordinator nodes is added correctly, however the process of deleting and adding data nodes never stops. I checked opeartor log, it keeps printing
{"level":"info","ts":1585191461.1530027,"logger":"driver","msg":"Calculated all required changes","to_create:":8,"to_keep:":6,"to_delete:":8}
This behaviour kept about 24 hours I thought this is never gonna stop So I changed max unavailable to 0 and because of this, it stopped deleting old nodes. However, it still trying to create a new node.
I don't know how this happens, Can anyone help me on this one. BTW I can't upgrade to 1.0 for now because this environment is heavily used by production so it's not easy to migrate to 1.0 now.
Thanks