Kubernetes Master pods get into splitbrain


(Raj ) #1

We had this happen few times with different versions of Elasticsearch 5.x

Our setup is 5 es-masters and minimum_master_nodes set to 3

The initial Logs:

[2018-07-04T13:17:33,571][INFO ][o.e.c.s.ClusterService ] [es-master-201806132159-2] removed {{es-client-5f4887dcc6-rtr47}{3C2YzN4DRMa2G3KPzBe1oQ}{Gepr6NcSR5WOfaEpdcPDeA}{100.96.59.14}{100.96.59.14:9300}{ml.max_open_jobs=10, ml.enabled=true},}, reason: zen-disco-receive(from master [master {es-master-201806132159-4}{IXiL8OObRFqicx8ML12R3w}{tVqAd6i0SkmKRs0lkCx_Yw}{100.96.58.14}{100.96.58.14:9300}{ml.max_open_jobs=10, ml.enabled=true} committed version [40209]])

and all the pods in cluster then started to lose connection to master.


(Mark Walkom) #2

Please show your config.

Also, why 5 masters?


(Raj ) #3

We increased to 5 just assuming it would help with split-brain scenario. Looks like it did not help
Below is the yml used for es-master

apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
  creationTimestamp: 2018-06-14T06:07:05Z
  generation: 1
  labels:
    component: elasticsearch
    role: master
  name: es-master-201806132159
  namespace: default
  resourceVersion: "99059760"
  selfLink: /apis/apps/v1beta1/namespaces/default/statefulsets/es-master-201806132159
  uid: 29bbdca4-6f99-11e8-84c2-06aa968610a2
spec:
  podManagementPolicy: OrderedReady
  replicas: 5
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      component: elasticsearch
      role: master
  serviceName: es-master
  template:
    metadata:
      annotations:
        pod.alpha.kubernetes.io/initialized: "true"
      creationTimestamp: null
      labels:
        component: elasticsearch
        role: master
    spec:
      containers:
      - env:
        - name: AUTO_CREATE_INDEX
          valueFrom:
            configMapKeyRef:
              key: auto_create_index
              name: elasticsearch-config
        - name: AWS_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              key: aws_access_key_id
              name: aws-secrets
        - name: AWS_SECRET_KEY
          valueFrom:
            secretKeyRef:
              key: aws_secret_access_key
              name: aws-secrets
        - name: CLUSTER_NAME
          value: prod
        - name: DISCOVERY_TYPE
          value: kubernetes
        - name: ES_JAVA_OPTS
          valueFrom:
            configMapKeyRef:
              key: es_master_java_opts
              name: elasticsearch-config
        - name: ES_PASSWORD
          valueFrom:
            secretKeyRef:
              key: es_password
              name: es-secrets
        - name: HTTP_ENABLE
          value: "false"
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
        - name: NETWORK_HOST
          value: 0.0.0.0
        - name: NODE_DATA
          value: "false"
        - name: NODE_MASTER
          value: "true"
        - name: NUMBER_OF_MASTERS
          value: "3"
        - name: SERVICE
          value: elasticsearch
        - name: XPACK_SECURITY
          valueFrom:
            configMapKeyRef:
              key: xpack_security
              name: elasticsearch-config
        - name: SMART_READINESS_PROBE
          valueFrom:
            configMapKeyRef:
              key: smart_readiness_probe
              name: elasticsearch-config
        image: elasticsearch:2018060601
        imagePullPolicy: Always
        livenessProbe:
          failureThreshold: 3
          initialDelaySeconds: 360
          periodSeconds: 10
          successThreshold: 1
          tcpSocket:
            port: 9300
          timeoutSeconds: 1
        name: es-master-201806132159
        ports:
        - containerPort: 9300
          protocol: TCP
        readinessProbe:
          exec:
            command:
            - /bin/bash
            - -c
            - if "$SMART_READINESS_PROBE" == "true"; then [[ "green" == $(curl -s
              -u elastic:$ES_PASSWORD elasticsearch:9200/_cluster/health|jq -r .status)
              ]]; fi
          failureThreshold: 3
          periodSeconds: 10
          successThreshold: 1
          timeoutSeconds: 1
        resources: {}
        securityContext:
          capabilities:
            add:
            - IPC_LOCK
          privileged: true
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /elasticsearch/data
          name: es-master-pvc
        - mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
          name: elasticsearch-config-volume
          subPath: elasticsearch.yml
      dnsPolicy: ClusterFirst
      nodeSelector:
           dedicated: backend
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: elasticsearch-admin
      serviceAccountName: elasticsearch-admin
      terminationGracePeriodSeconds: 0
      tolerations:
      - effect: NoSchedule
        key: dedicated
        operator: Equal
        value: backend
      volumes:
      - configMap:
          defaultMode: 420
          name: elasticsearch-config
        name: elasticsearch-config-volume
  updateStrategy:
    type: OnDelete
  volumeClaimTemplates:
  - metadata:
      annotations:
        volume.beta.kubernetes.io/storage-class: default
      creationTimestamp: null
      name: es-master-pvc
    spec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 100Gi

(Mark Walkom) #4

Can you edit that and use markdown backticks, or the </> button to format that as code.

Also, can you show your elasticsearch.yml config?


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.