ES 7.3.0 cluster crashes with error "failed to flush export bulks" after restarting data POD

praven_rai · March 7, 2020, 1:00pm

Hello Team,

I am running into issues with my Elasticsearch cluster v7.3.0 deployed under AWS EKS.
I have a 3 node cluster in which 3 PODs configured i.e. ES Master-1, ES Data-1, ES Ingest-1.

Persistent Valume and PersistenVolumeClaim has been configured as follows:

Persisten Volume:

Name:              pvc-be7975f2-ce35-4e40-9c91-98e1d948362b
Labels:            failure-domain.beta.kubernetes.io/region=us-west-2
                   failure-domain.beta.kubernetes.io/zone=us-west-2a
Annotations:       kubernetes.io/createdby: aws-ebs-dynamic-provisioner
                   pv.kubernetes.io/bound-by-controller: yes
                   pv.kubernetes.io/provisioned-by: kubernetes.io/aws-ebs
Finalizers:        [kubernetes.io/pv-protection]
StorageClass:      gp2
Status:            Bound
Claim:             accurics-lmm/ebs-gp2-storage-elasticsearch-data-0
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          30Gi
Node Affinity:     
  Required Terms:  
    Term 0:        failure-domain.beta.kubernetes.io/zone in [us-west-2a]
                   failure-domain.beta.kubernetes.io/region in [us-west-2]
Message:           
Source:
    Type:       AWSElasticBlockStore (a Persistent Disk resource in AWS)
    VolumeID:   aws://us-west-2a/vol-08bdc3XXXXXXXXXX
    FSType:     ext4
    Partition:  0
    ReadOnly:   false

Below are the configurations for Data pod:
elasticsearch-data-configmap.yaml

---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: accurics-lmm
  name: elasticsearch-data-config
  labels:
    app: elasticsearch
    role: data
data:
  elasticsearch.yml: |-
    cluster.name: ${CLUSTER_NAME}
    node.name: ${NODE_NAME}
    discovery.seed_hosts: ${NODE_LIST}
    cluster.initial_master_nodes: ${MASTER_NODES}
    network.host: 0.0.0.0
    node:
      master: false
      data: true
      ingest: false
    xpack.security.enabled: true
    xpack.monitoring.collection.enabled: true
    path.data: /usr/share/elasticsearch/data
---

elasticsearch-data-statefulset.yaml

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  namespace: accurics-lmm
  name: elasticsearch-data
  labels:
    app: elasticsearch
    role: data
spec:
  serviceName: "elasticsearch-data"
  replicas: 2
  selector:
    matchLabels:
      app: elasticsearch-data
  template:
    metadata:
      labels:
        app: elasticsearch-data
        role: data
    spec:
      containers:
      - name: elasticsearch-data
        image: docker.elastic.co/elasticsearch/elasticsearch:7.3.0
        env:
        - name: CLUSTER_NAME
          value: elasticsearch
        - name: NODE_NAME
          value: elasticsearch-data
        - name: NODE_LIST
          value: elasticsearch-master,elasticsearch-data,elasticsearch-client
        - name: MASTER_NODES
          value: elasticsearch-master
        - name: "ES_JAVA_OPTS"
          value: "-Xms300m -Xmx300m"
        ports:
        - containerPort: 9300
          name: transport
        volumeMounts:
        - name: config
          mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
          readOnly: true
          subPath: elasticsearch.yml
      volumes:
      - name: config
        configMap:
          name: elasticsearch-data-config
      initContainers:
      - name: increase-vm-max-map
        image: busybox:1.28
        command: ["sh", "-c", "sysctl -w  vm.max_map_count=262144"]
        securityContext:
          privileged: true
      - name: resolve-permission
        image: busybox:1.28
        command: ["sh", "-c", "chown -R 1000:1000 /usr/share/elasticsearch/data"]
        securityContext:
          privileged: true
        volumeMounts:
        - name: ebs-gp2-storage
          mountPath: /usr/share/elasticsearch/data
      - name: increase-fd-ulimit
        image: busybox:1.28
        command: ["sh", "-c", "ulimit -n 65536"]
        securityContext:
          privileged: true
  volumeClaimTemplates:
  - metadata:
      name: ebs-gp2-storage
      annotations:
        volume.beta.kubernetes.io/storage-class: "gp2"
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: gp2
      resources:
        requests:
          storage: 30Gi
---

Creating cluster at very first time works fine and cluster state updates to Green. It allow me to generate password and login using Kibana.

Issue:
After deleting the data pod/or rolling out updates, my cluster starts crashing with multiple errors.

{"type": "server", "timestamp": "2020-03-07T11:59:37,754+0000", "level": "WARN", "component": "o.e.x.m.MonitoringService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-data", "cluster.uuid": "Mi1MiJAKSVy1HCxEpBSHeg", "node.id": "GoW0wiLDQr6YDQfXm_qMpA", "message": "monitoring execution failed" ,
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks",

"Caused by: org.elasticsearch.action.UnavailableShardsException: [.monitoring-es-7-2020.03.07][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2020.03.07][0]] containing [index {[.monitoring-es-7-2020.03.07][_doc][pVLctHABfG6veKc3jHmy]

After few minutes Kibana starts failing with below error:

{"type":"log","@timestamp":"2020-03-07T12:18:17Z","tags":["error","task_manager"],"pid":1,"message":"Failed to poll for work: [security_exception] failed to authenticate user [kibana], with { header={ WWW-Authenticate="Basic realm=\"security\" charset=\"UTF-8\"" } }

Please let me know if you need any further details. Please help me on this.

DavidTurner · March 7, 2020, 1:57pm

This is not a crash. You seem to be shutting down your only data node? If so, you should expect things to stop working until that node comes back up again.

praven_rai · March 7, 2020, 2:18pm

Thanks for your prompt response, David.
OK. But when data node comes up again, cluster health status doesn't set to Green. This issue was observed when I ran "kubectl apply -f ...." to upgrade my cluster to 7.6.0. So containers were upgraded but started failing. I then rolled back to 7.3.0 and replicated the issue after deleting the data node.

I can see following errors on both Master and Data nodes:

{"type": "server", "timestamp": "2020-03-07T12:19:40,485+0000", "level": "WARN", "component": "o.e.x.m.e.l.LocalExporter", "cluster.name": "elasticsearch", "node.name": "elasticsearch-master", "cluster.uuid": "Mi1MiJAKSVy1HCxEpBSHeg", "node.id": "lQkl0mPtTQG-zSsYZS4-8w", "message": "unexpected error while indexing monitoring document" ,

"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: UnavailableShardsException[[.monitoring-es-7-2020.03.07][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[.monitoring-es-7-2020.03.07][0]] containing [9] requests]]",

"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:125) ~[x-pack-monitoring-7.3.0.jar:7.3.0]",

"at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195) ~[?:?]",

"Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents",

"at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:121) ~[?:?]",

"... 40 more"] }

DavidTurner · March 7, 2020, 5:44pm

You deleted your only data node? If so, these errors seem unsurprising. You need a data node to index documents.

system · April 4, 2020, 5:45pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Failed to flush export bulks Elasticsearch	1	6018	February 7, 2017
Elastic search data nodes kept crashing continuously Elasticsearch	8	323	August 18, 2023
Elasticsearch ECK in CrashLoopBackoff after node failure Elasticsearch	5	392	May 28, 2024
Failed to flush export bulks / bulk [my_local] reports failures when exporting documents Elasticsearch	1	5872	January 28, 2019
Failed to flush export bulks [default_local] Elasticsearch	1	1800	October 7, 2019

ES 7.3.0 cluster crashes with error "failed to flush export bulks" after restarting data POD

Related topics