Elasticsearch failing. Missing Shard. .kibana_task_manager_7.15.0_001

Hi,

currently I have a lot of trouble to keep the ELK alive. First it failed after 11 days. Then I redeployed everything because of no important production data. Now it failed with the same errors after 2 days.

I am using Elastic Cloud on Kubernetes 8.0, running in an Openshift 4.6 cluster using Azure Files. Images used are on version 7.15.0. (Elasticsearch/Kibana/Filebeat/Metricbeat)

I am completly new to Elastic, so I really appreciate your help. I tried to get the nessary parts out of the pod log.

Elasticsearch Pod Log

{"type": "server", "timestamp": "2021-10-07T07:08:42,729Z", "level": "ERROR", "component": "o.e.i.g.DatabaseRegistry", "cluster.name": "elasticsearch", "node.name": "elasticsearch-es-elastic-0", "message": "failed to download database [GeoLite2-ASN.mmdb]",
"stacktrace": ["org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];",

[...]

"Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: Search rejected due to missing shards [[.kibana_task_manager_7.15.0_001][0]]. Consider using allow_partial_search_results setting to bypass this error.",

[...]

{"type":"retryable_es_client_error","message":"search_phase_execution_exception: ","error":{"name":"ResponseError","meta":{"body":{"error":{"root_cause":,"type":"search_phase_execution_exception","reason":"","phase":"open_search_context","grouped":true,"failed_shards":,"caused_by":{"type":"search_phase_execution_exception","reason":"Search rejected due to missing shards [[.kibana_task_manager_7.15.0_001][0]]. Consider using allow_partial_search_results setting to bypass this error.","phase":"open_search_context","grouped":true,"failed_shards":

Kibana Pod Log

{"type":"retryable_es_client_error","message":"search_phase_execution_exception: ","error":{"name":"ResponseError","meta":{"body":{"error":{"root_cause":,"type":"search_phase_execution_exception","reason":"","phase":"open_search_context","grouped":true,"failed_shards":,"caused_by":{"type":"search_phase_execution_exception","reason":"Search rejected due to missing shards [[.kibana_task_manager_7.15.0_001][0]]. Consider using allow_partial_search_results setting to bypass this error.","phase":"open_search_context","grouped":true,"failed_shards":

{"type":"log","@timestamp":"2021-10-07T07:46:41+00:00","tags":["fatal","root"],"pid":1215,"message":"Error: Unable to complete saved object migrations for the [.kibana_task_manager] index: Unable to complete the OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT step after 15 attempts, terminating.\n at migrationStateActionMachine
FATAL Error: Unable to complete saved object migrations for the [.kibana_task_manager] index: Unable to complete the OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT step after 15 attempts, terminating.
Blockquote

Deployment yaml

apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: filebeat
spec:
  type: filebeat
  version: 7.15.0
  elasticsearchRef:
    name: elasticsearch
  kibanaRef:
    name: kibana
  config:
    output.elasticsearch:
      index: "filebeat-%{[agent.version]}-%{+xxxx.ww}"
    setup.ilm.enabled: "false"
    setup.template.name: "filebeat"
    setup.template.pattern: "filebeat-*"
    filebeat.autodiscover.providers:
    - node: ${NODE_NAME}
      type: kubernetes
      hints.default_config.enabled: "false"
      templates:
      - condition.equals.kubernetes.namespace: aro-crs-dev-01
        config:
        - paths: ["/var/log/containers/*${data.kubernetes.container.id}.log"]
          type: container
          processors:
          - decode_json_fields:
              fields: "message"
              process_array: false
              max_depth: 1
              target: "logMessage"
              overwrite_keys: false
              add_error_key: true
              expand_keys: true
[...]
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: filebeat
        automountServiceAccountToken: true
        terminationGracePeriodSeconds: 30
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true # Allows to provide richer host metadata
        containers:
        - name: filebeat
          securityContext:
            runAsUser: 0
            # If using Red Hat OpenShift uncomment this:
            privileged: true
          volumeMounts:
          - name: varlogcontainers
            mountPath: /var/log/containers
          - name: varlogpods
            mountPath: /var/log/pods
          - name: varlibdockercontainers
            mountPath: /var/lib/docker/containers
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
        volumes:
        - name: varlogcontainers
          hostPath:
            path: /var/log/containers
        - name: varlogpods
          hostPath:
            path: /var/log/pods
        - name: varlibdockercontainers
          hostPath:
            path: /var/lib/docker/containers
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: filebeat
rules:
- apiGroups: [""] # "" indicates the core API group
  resources:
  - namespaces
  - pods
  - nodes
  verbs:
  - get
  - watch
  - list
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: filebeat
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: filebeat
subjects:
- kind: ServiceAccount
  name: filebeat
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: filebeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: heartbeat
spec:
  type: heartbeat
  version: 7.15.0
  elasticsearchRef:
    name: elasticsearch
  config:
    heartbeat.monitors:
    - type: tcp
      schedule: '@every 5s'
      hosts: ["elasticsearch-es-http.elastic.svc:9200"]
    - type: tcp
      schedule: '@every 5s'
      hosts: ["kibana-kb-http.elastic.svc:5601"]
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: heartbeat
        securityContext:
          runAsUser: 0
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: heartbeat
  labels:
    k8s-app: heartbeat
rules:
- apiGroups: [""]
  resources:
  - nodes
  - namespaces
  - pods
  - services
  verbs: ["get", "list", "watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: heartbeat
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: heartbeat
subjects:
- kind: ServiceAccount
  name: heartbeat
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: hearbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: beat.k8s.elastic.co/v1beta1
kind: Beat
metadata:
  name: metricbeat
spec:
  type: metricbeat
  version: 7.15.0
  elasticsearchRef:
    name: elasticsearch
  kibanaRef:
    name: kibana
  config:
    metricbeat:
      autodiscover:
        providers:
        - hints:
            default_config: {}
            enabled: "true"
          host: ${NODE_NAME}
          type: kubernetes
      modules:
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crscs-dev-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crsias-dev-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crssps-dev-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crscs-int-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crsias-int-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: prometheus
        period: 10s
        metricsets: ["collector"]
        hosts: ["https://aro-crssps-int-01.ngdalabor.de"]
        metrics_path: /q/metrics
      - module: system
        period: 30s
        metricsets:
        - cpu
        - load
        - memory
        - network
        - process
        - process_summary
        process:
          include_top_n:
            by_cpu: 5
            by_memory: 5
        processes:
        - .*
      - module: system
        period: 1m
        metricsets:
        - filesystem
        - fsstat
        processors:
        - drop_event:
            when:
              regexp:
                system:
                  filesystem:
                    mount_point: ^/(sys|cgroup|proc|dev|etc|host|lib)($|/)
      - module: kubernetes
        period: 10s
        host: ${NODE_NAME}
        hosts:
        - https://${NODE_NAME}:10250
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        ssl:
          verification_mode: none
        metricsets:
        - node
        - system
        - pod
        - container
        - volume
    processors:
    - add_cloud_metadata: {}
    - add_host_metadata: {}
  daemonSet:
    podTemplate:
      spec:
        serviceAccountName: metricbeat
        automountServiceAccountToken: true # some older Beat versions are depending on this settings presence in k8s context
        containers:
        - args:
          - -e
          - -c
          - /etc/beat.yml
          - -system.hostfs=/hostfs
          name: metricbeat
          volumeMounts:
          - mountPath: /hostfs/sys/fs/cgroup
            name: cgroup
          - mountPath: /var/run/docker.sock
            name: dockersock
          - mountPath: /hostfs/proc
            name: proc
          env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                fieldPath: spec.nodeName
        dnsPolicy: ClusterFirstWithHostNet
        hostNetwork: true # Allows to provide richer host metadata
        securityContext:
          runAsUser: 0
        terminationGracePeriodSeconds: 30
        volumes:
        - hostPath:
            path: /sys/fs/cgroup
          name: cgroup
        - hostPath:
            path: /var/run/docker.sock
          name: dockersock
        - hostPath:
            path: /proc
          name: proc
---
# permissions needed for metricbeat
# source: https://www.elastic.co/guide/en/beats/metricbeat/current/metricbeat-module-kubernetes.html
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: metricbeat
rules:
- apiGroups:
  - ""
  resources:
  - nodes
  - namespaces
  - events
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - "extensions"
  resources:
  - replicasets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  - deployments
  - replicasets
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - nodes/stats
  verbs:
  - get
- nonResourceURLs:
  - /metrics
  verbs:
  - get
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metricbeat
  namespace: elastic
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: metricbeat
subjects:
- kind: ServiceAccount
  name: metricbeat
  namespace: elastic
roleRef:
  kind: ClusterRole
  name: metricbeat
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
spec:
  version: 7.15.0
  nodeSets:
  - name: elastic
    count: 3
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 200Gi
        storageClassName: azurefile-premiumstorageclass-prod-01
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 7.15.0
  count: 1
  elasticsearchRef:
    name: elasticsearch

Hey @WookWook, thanks for your question.

NFS/SMB are not recommended for Elasticsearch. Furthermore, Azure Files (since it's SMB based) has a known issue. While fix on ES side exists, I'd recommend to use Azure Disk (Managed Disk) for elasticsearch-data volume instead.

Thanks,
David

1 Like

Thanks for the reply @dkow . I will redeploy with azure disks.

Might be a good hint in the documentation here?

Only read benefits so I didnt expect an issue here.

Hello @dkow ,

seems to be way more stable with azure disks now.

Thank you very much!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.