Elasticsearch failing. Missing Shard. .kibana_task_manager_7.15.0_001

Hi,

currently I have a lot of trouble to keep the ELK alive. First it failed after 11 days. Then I redeployed everything because of no important production data. Now it failed with the same errors after 2 days.

I am using Elastic Cloud on Kubernetes 8.0, running in an Openshift 4.6 cluster using Azure Files. Images used are on version 7.15.0. (Elasticsearch/Kibana/Filebeat/Metricbeat)

I am completly new to Elastic, so I really appreciate your help. I tried to get the nessary parts out of the pod log.

Elasticsearch Pod Log

{"type": "server", "timestamp": "2021-10-07T07:08:42,729Z", "level": "ERROR", "component": "o.e.i.g.DatabaseRegistry", "cluster.name": "Elasticsearch", "node.name": "Elasticsearch-es-elastic-0", "message": "failed to download database [GeoLite2-ASN.mmdb]",
"stacktrace": ["org.Elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];",

[...]

"Caused by: org.Elasticsearch.action.search.SearchPhaseExecutionException: Search rejected due to missing shards [[.kibana_task_manager_7.15.0_001][0]]. Consider using allow_partial_search_results setting to bypass this error.",

[...]

{"type":"retryable_es_client_error","message":"search_phase_execution_exception: ","error":{"name":"ResponseError","meta":{"body":{"error":{"root_cause":,"type":"search_phase_execution_exception","reason":"","phase":"open_search_context","grouped":true,"failed_shards":,"caused_by":{"type":"search_phase_execution_exception","reason":"Search rejected due to missing shards [[.kibana_task_manager_7.15.0_001][0]]. Consider using allow_partial_search_results setting to bypass this error.","phase":"open_search_context","grouped":true,"failed_shards":

Kibana Pod Log

{"type":"retryable_es_client_error","message":"search_phase_execution_exception: ","error":{"name":"ResponseError","meta":{"body":{"error":{"root_cause":,"type":"search_phase_execution_exception","reason":"","phase":"open_search_context","grouped":true,"failed_shards":,"caused_by":{"type":"search_phase_execution_exception","reason":"Search rejected due to missing shards [[.kibana_task_manager_7.15.0_001][0]]. Consider using allow_partial_search_results setting to bypass this error.","phase":"open_search_context","grouped":true,"failed_shards":

{"type":"log","@timestamp":"2021-10-07T07:46:41+00:00","tags":["fatal","root"],"pid":1215,"message":"Error: Unable to complete saved object migrations for the [.kibana_task_manager] index: Unable to complete the OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT step after 15 attempts, terminating.\n at migrationStateActionMachine
FATAL Error: Unable to complete saved object migrations for the [.kibana_task_manager] index: Unable to complete the OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT step after 15 attempts, terminating.
Blockquote

Deployment yaml

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: filebeat
spec:
  type: filebeat
  version: 7.15.0
  elasticsearchRef:
    name: elasticsearch
  kibanaRef:
    name: kibana
  config:
    output.elasticsearch:
      index: "filebeat-%{[agent.version]}-%{+xxxx.ww}"
    setup.ilm.enabled: "false"
    setup.template.name: "filebeat"
    setup.template.pattern: "filebeat-*"
    filebeat.autodiscover.providers:
    - node: ${NODE_NAME}
      type: kubernetes
      hints.default_config.enabled: "false"
      templates:
      - condition.equals.kubernetes.namespace: aro-crs-dev-01
        config:
        - paths: ["/var/log/containers/*${data.kubernetes.container.id}.log"]
          type: container
          processors:
          - decode_json_fields:
              fields: "message"
              process_array: false
              max_depth: 1
              target: "logMessage"
              overwrite_keys: false
              add_error_key: true
              expand_keys: true
[...]
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: filebeat
        automountServiceAccountToken: true
        terminationGracePeriodSeconds: 30
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true # Allows to provide richer host metadata
        containers:
        - name: filebeat
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            privileged: true
          volumeMounts:
          - name: varlogcontainers
            mountPath: /var/log/containers
          - name: varlogpods
            mountPath: /var/log/pods
          - name: varlibdockercontainers
            mountPath: /var/lib/docker/containers
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
        volumes:
        - name: varlogcontainers
          hostPath:
            path: /var/log/containers
        - name: varlogpods
          hostPath:
            path: /var/log/pods
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - namespaces
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
subjects:
- kind: ServiceAccount
  name: filebeat
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: heartbeat
spec:
  type: heartbeat
  version: 7.15.0
  elasticsearchRef:
    name: elasticsearch
  config:
    heartbeat.monitors:
    - type: tcp
      schedule: '@every 5s'
      hosts: ["elasticsearch-es-http.elastic.svc:9200"]
    - type: tcp
      schedule: '@every 5s'
      hosts: ["kibana-kb-http.elastic.svc:5601"]
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: heartbeat
        securityContext:
          runAsUser: 0
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: heartbeat
  labels:
    k8s-app: heartbeat
rules:
- apiGroups: [""]
  resources:
  - nodes
  - namespaces
  - pods
  - services
  verbs: ["get", "list", "watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: heartbeat
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: heartbeat
subjects:
- kind: ServiceAccount
  name: heartbeat
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: hearbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: metricbeat
spec:
  type: metricbeat
  version: 7.15.0
  elasticsearchRef:
    name: elasticsearch
  kibanaRef:
    name: kibana
  config:
    metricbeat:
      autodiscover:
        providers:
        - hints:
            default_config: {}
            enabled: "true"
          host: ${NODE_NAME}
          type: kubernetes
      modules:
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crscs-dev-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crsias-dev-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crssps-dev-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crscs-int-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crsias-int-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crssps-int-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: system
        period: 30s
        metricsets:
        - cpu
        - load
        - memory
        - network
        - process
        - process_summary
        process:
          include_top_n:
            by_cpu: 5
            by_memory: 5
        processes:
        - .*
      - module: system
        period: 1m
        metricsets:
        - filesystem
        - fsstat
        processors:
        - drop_event:
            when:
              regexp:
                system:
                  filesystem:
                    mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib)($|/)
      - module: kubernetes
        period: 10s
        host: ${NODE_NAME}
        hosts:
        - https://${NODE_NAME}:10250
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        ssl:
          verification_mode: none
        metricsets:
        - node
        - system
        - pod
        - container
        - volume
    processors:
    - add_cloud_metadata: {}
    - add_host_metadata: {}
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: metricbeat
        automountServiceAccountToken: true # some older Beat versions are depending on this settings presence in k8s context
        containers:
        - args:
          - -e
          - -c
          - /etc/beat.yml
          - -system.hostfs=/hostfs
          name: metricbeat
          volumeMounts:
          - mountPath: /hostfs/sys/fs/cgroup
            name: cgroup
          - mountPath: /var/run/docker.sock
            name: dockersock
          - mountPath: /hostfs/proc
            name: proc
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true # Allows to provide richer host metadata
        securityContext:
          runAsUser: 0
        terminationGracePeriodSeconds: 30
        volumes:
        - hostPath:
            path: /sys/fs/cgroup
          name: cgroup
        - hostPath:
            path: /var/run/docker.sock
          name: dockersock
        - hostPath:
            path: /proc
          name: proc
---
# permissions needed for metricbeat
# source: https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-kubernetes.html
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metricbeat
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - namespaces
  - events
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - "extensions"
  resources:
  - replicasets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - deployments
  - replicasets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes/stats
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metricbeat
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metricbeat
subjects:
- kind: ServiceAccount
  name: metricbeat
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
spec:
  version: 7.15.0
  nodeSets:
  - name: elastic
    count: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 200Gi
        storageClassName: azurefile-premiumstorageclass-prod-01
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 7.15.0
  count: 1
  elasticsearchRef:
    name: elasticsearch

Hey @WookWook, thanks for your question.

NFS/SMB are not recommended for Elasticsearch. Furthermore, Azure Files (since it's SMB based) has a known issue. While fix on ES side exists, I'd recommend to use Azure Disk (Managed Disk) for elasticsearch-data volume instead.

Thanks,
David

Thanks for the reply @dkow . I will redeploy with azure disks.

Might be a good hint in the documentation here?

Only read benefits so I didnt expect an issue here.