ES 7.3.0 cluster crashes with error "failed to flush export bulks" after restarting data POD

Hello Team,

I am running into issues with my Elasticsearch cluster v7.3.0 deployed under AWS EKS.
I have a 3 node cluster in which 3 PODs configured i.e. ES Master-1, ES Data-1, ES Ingest-1.

Persistent Valume and PersistenVolumeClaim has been configured as follows:

Persisten Volume:

Name:              pvc-be7975f2-ce35-4e40-9c91-98e1d948362b
Labels:            failure-domain.beta.kubernetes.io/region=us-west-2
                   failure-domain.beta.kubernetes.io/zone=us-west-2a
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             accurics-lmm/ebs-gp2-storage-elasticsearch-data-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          30Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-west-2a]
                   failure-domain.beta.kubernetes.io/region in [us-west-2]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-west-2a/vol-08bdc3XXXXXXXXXX
    FSType:     ext4
    Partition:  0
    ReadOnly:   false

Below are the configurations for Data pod:
elasticsearch-data-configmap.yaml

---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: accurics-lmm
  name: elasticsearch-data-config
  labels:
    app: elasticsearch
    role: data
data:
  elasticsearch.yml: |-
    cluster.name: ${CLUSTER_NAME}
    node.name: ${NODE_NAME}
    discovery.seed_hosts: ${NODE_LIST}
    cluster.initial_master_nodes: ${MASTER_NODES}
    network.host: 0.0.0.0
    node:
      master: false
      data: true
      ingest: false
    xpack.security.enabled: true
    xpack.monitoring.collection.enabled: true
    path.data: /usr/share/elasticsearch/data
---

elasticsearch-data-statefulset.yaml

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: accurics-lmm
  name: elasticsearch-data
  labels:
    app: elasticsearch
    role: data
spec:
  serviceName: "elasticsearch-data"
  replicas: 2
  selector:
    matchLabels:
      app: elasticsearch-data
  template:
    metadata:
      labels:
        app: elasticsearch-data
        role: data
    spec:
      containers:
      - name: elasticsearch-data
        image: docker.elastic.co/elasticsearch/elasticsearch:7.3.0
        env:
        - name: CLUSTER_NAME
          value: elasticsearch
        - name: NODE_NAME
          value: elasticsearch-data
        - name: NODE_LIST
          value: elasticsearch-master,elasticsearch-data,elasticsearch-client
        - name: MASTER_NODES
          value: elasticsearch-master
        - name: "ES_JAVA_OPTS"
          value: "-Xms300m -Xmx300m"
        ports:
        - containerPort: 9300
          name: transport
        volumeMounts:
        - name: config
          mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
          readOnly: true
          subPath: elasticsearch.yml
      volumes:
      - name: config
        configMap:
          name: elasticsearch-data-config
      initContainers:
      - name: increase-vm-max-map
        image: busybox:1.28
        command: ["sh", "-c", "sysctl -w  vm.max_map_count=262144"]
        securityContext:
          privileged: true
      - name: resolve-permission
        image: busybox:1.28
        command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"]
        securityContext:
          privileged: true
        volumeMounts:
        - name: ebs-gp2-storage
          mountPath: /usr/share/elasticsearch/data
      - name: increase-fd-ulimit
        image: busybox:1.28
        command: ["sh", "-c", "ulimit -n 65536"]
        securityContext:
          privileged: true
  volumeClaimTemplates:
  - metadata:
      name: ebs-gp2-storage
      annotations:
        volume.beta.kubernetes.io/storage-class: "gp2"
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: gp2
      resources:
        requests:
          storage: 30Gi
---

Creating cluster at very first time works fine and cluster state updates to Green. It allow me to generate password and login using Kibana.

Issue:
After deleting the data pod/or rolling out updates, my cluster starts crashing with multiple errors.

{"type": "server", "timestamp": "2020-03-07T11:59:37,754+0000", "level": "WARN", "component": "o.e.x.m.MonitoringService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-data", "cluster.uuid": "Mi1MiJAKSVy1HCxEpBSHeg", "node.id": "GoW0wiLDQr6YDQfXm_qMpA", "message": "monitoring execution failed" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks",

"Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-7-2020.03.07][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2020.03.07][0]] containing [index {[.monitoring-es-7-2020.03.07][_doc][pVLctHABfG6veKc3jHmy]

After few minutes Kibana starts failing with below error:

{"type":"log","@timestamp":"2020-03-07T12:18:17Z","tags":["error","task_manager"],"pid":1,"message":"Failed to poll for work: [security_exception] failed to authenticate user [kibana], with { header={ WWW-Authenticate="Basic realm=\"security\" charset=\"UTF-8\"" } }

Please let me know if you need any further details. Please help me on this.

This is not a crash. You seem to be shutting down your only data node? If so, you should expect things to stop working until that node comes back up again.

Thanks for your prompt response, David.
OK. But when data node comes up again, cluster health status doesn't set to Green. This issue was observed when I ran "kubectl apply -f ...." to upgrade my cluster to 7.6.0. So containers were upgraded but started failing. I then rolled back to 7.3.0 and replicated the issue after deleting the data node.

I can see following errors on both Master and Data nodes:

{"type": "server", "timestamp": "2020-03-07T12:19:40,485+0000", "level": "WARN", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master", "cluster.uuid": "Mi1MiJAKSVy1HCxEpBSHeg", "node.id": "lQkl0mPtTQG-zSsYZS4-8w", "message": "unexpected error while indexing monitoring document" ,

"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-7-2020.03.07][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2020.03.07][0]] containing [9] requests]]",

"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.3.0.jar:7.3.0]",

"at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]",

"Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents",

"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:121) ~[?:?]",

"... 40 more"] }

You deleted your only data node? If so, these errors seem unsurprising. You need a data node to index documents.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.