Elasticsearch instability using ECK operator on openshift cluster

We are getting issues with elasticsearch pods, the pods become in a pending state twice in 4 months.

sometimes, when it's loading a few dashboards on kibana the pods are restarting, we have added the ILM policy but frozen indices were still opened, to close these indices we have created a cronjob to close.

if there is anything with the details below, please feel free to share any find.

thanks a lot,
Pedro

ECK Info:

Our current stack is running ECK operator with fluentd(logs), heartbeat(uptime) and APM deployed into openshift cluster.

stack versions:

ECK 2.3.0
Elasticsearch: 7.13.3
Kibana: 7.13.3
Fluentd: 1.13.2
APM: 7.13.3

The elasticsearch resource configuration is:

Storage: 2000Gb (used 886Gb)
CPU: 1.7
Mem:10GB
java opts: -Xms8g -Xmx8g
nodes: 4

The indices ILM strategy is:

  • For logs:
    • hot: 3 days
    • warm: + 3 days
    • cold: + 7days (frozen)

Size by stage:

Hot/Warm data: 2.7Gb
Frozen & closed data: 883.3Gb

Erros from logs:

Caused by: org.elasticsearch.cluster.block.ClusterBlockException: blocked by: [SERVICE_UNAVAILABLE/2/no master];",
"stacktrace": ["org.elasticsearch.xpack.monitoring.exporter.ExportException: 

ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/2/no master];]",

master not discovered or elected yet

org.elasticsearch.transport.NodeDisconnectedException

All shards failed

{"type": "server", "timestamp": "2022-09-24T23:53:13,151Z", "level": "INFO", "component": "o.e.m.j.JvmGcMonitorService", "cluster.name": "elasticsearch", "node.name": "elasticsearch-es-default-0", "message": "[gc][4343174] overhead, spent [290ms] collecting in the last [1s]", "cluster.uuid": "oY7xZhUySiKbHHfr1t0pgQ", "node.id": "0YExrwJvRia_A6J-2aum4w"  }

Elastic yml:

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
spec:
  version: 7.13.3
  volumeClaimDeletePolicy: DeleteOnScaledownOnly
  nodeSets:
  - name: default
    config:
      node.roles: ["master", "data", "ingest", "ml"]
      path.repo: ["/elastic-snapshot"]
    podTemplate:
      metadata:
        labels:
          elastic: elastic
      spec:
        serviceAccount: elastic
        initContainers:
        - name: sysctl
          securityContext:
            privileged: true
          command: ['sh', '-c', 'sysctl -w vm.max_map_count=262144']
        containers:
        - name: elasticsearch
          securityContext:
            capabilities:
              add: ["SYS_CHROOT"]
          resources:
            limits:
              memory: 10Gi
              cpu: 1.7
          env:
          - name: INSTANCE_RAM
              value: 10G
          - name: ES_JAVA_OPTS
            value: "-Xms8g -Xmx8g"
    count: 4
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2000Gi

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.