On-prem eck cluster disconnects after a couple hours and can't be reconnected

Greyson · May 6, 2024, 11:11pm

My company uses elastic and I wanted to get more familiar with it so I spun up a k3's cluster in my homelab and have been trying to get it working for the last couple weeks but it keeps having issues.

My setup is as follows:
There are three machines in the cluster, two of those machines have 64 cores, 15tb nvmes, and 512gb memory each, and these are the ones i'm trying to deploy the nodes to. K3's is running on the machines and they are seen as a cluster. The latest eck operator was installed and when I run the below the cluster builds without issue. I'm able to login to kibana, download http_ca.crt and use that plus the password to create indexes and load data using a python script.

The issue:
I'm loading 100m(100gb) in batches of 5000 and each batch usually takes between 2-6 seconds and it runs fine, until at some point the time per batch jumps to 240seconds, then it processes a couple additional batches and then the script gets a timeout error from elastic. Once this happens, If i try to restart the script it will either timeout again after a few minutes or it will load a couple batches that take hundreds of seconds each before timing out yet again. Kibana still works to a degree and I can see that the nodes and health all show green. I say somewhat because the response time in the elastic dashboard after the initial timout occurs increases to tens of seconds and sometimes I need to reload the page multiple times to get it to work.

The only solution to get it working again is to destroy the cluster and start over which is obviously not practical. Sometimes the timeout starts occuring after 60 minutes, other times it takes up to 3.5 hours, but eventually it does stop responding.

I have looked at the logs and have googled basically every warn and error message in them and I still can't decipher what is causing this to happen. I'm at a loss on what to do and I really don't to run a single node setup considering I will ultimately need to import 10tb of data in total into my cluster. I'm looking for some help or guidance on what could be causing this issue.

Thanks in advance for any help My Kubernetes yaml file is below.

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch-cluster
  namespace: elastic-stack
spec:
  version: 8.13.1
  nodeSets:
    - name: machine1-node-set
      count: 1
      config:
        node.store.allow_mmap: false
        xpack.monitoring.collection.enabled: true
        network.host: 0.0.0.0
        discovery.seed_hosts: ["elasticsearch-cluster-es-transport.elastic-stack.svc.cluster.local"]
        cluster.initial_master_nodes: ["elasticsearch-cluster-es-http"]
      podTemplate:
        metadata:
          labels:
            app: elasticsearch-cluster-es-node
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: kubernetes.io/hostname
                        operator: In
                        values:
                          - machine1
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: "12Gi"
                  cpu: "8000m"
                limits:
                  memory: "16Gi"
                  cpu: "12000m"
    - name: machine2-node-set
      count: 2
      config:
        node.store.allow_mmap: false
        xpack.monitoring.collection.enabled: true
        network.host: 0.0.0.0
        discovery.seed_hosts: ["elasticsearch-cluster-es-transport.elastic-stack.svc.cluster.local"]
      podTemplate:
        metadata:
          labels:
            app: elasticsearch-cluster-es-node
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                  - matchExpressions:
                      - key: kubernetes.io/hostname
                        operator: In
                        values:
                          - machine2
          containers:
            - name: elasticsearch
              resources:
                requests:
                  memory: "12Gi"
                  cpu: "8000m"
                limits:
                  memory: "16Gi"
                  cpu: "12000m"
---
apiVersion: v1
kind: Service
metadata:
  name: elasticsearch-cluster-es-http
  namespace: elastic-stack
  labels:
    common.k8s.elastic.co/type: elasticsearch
    elasticsearch.k8s.elastic.co/cluster-name: elasticsearch-cluster
spec:
  type: NodePort
  selector:
    app: elasticsearch-cluster-es-node
    common.k8s.elastic.co/type: elasticsearch
  ports:
    - name: https
      port: 9200
      protocol: TCP
      targetPort: 9200
      nodePort: 32123

---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
  namespace: elastic-stack
spec:
  version: 8.13.1
  count: 1
  elasticsearchRef:
    name: elasticsearch-cluster
  podTemplate:
    spec:
      containers:
        - name: kibana
          resources:
            requests:
              memory: "2Gi"
              cpu: "1000m"
            limits:
              memory: "4Gi"
              cpu: "4000m"

---
apiVersion: v1
kind: Service
metadata:
  name: kibana-kb-http
  namespace: elastic-stack
  labels:
    common.k8s.elastic.co/type: kibana
    kibana.k8s.elastic.co/name: kibana
spec:
  type: NodePort
  selector:
    common.k8s.elastic.co/type: kibana
    kibana.k8s.elastic.co/name: kibana
  ports:
    - name: https
      port: 5601
      protocol: TCP
      targetPort: 5601
      nodePort: 32700

Topic		Replies	Views
Elastic request timeout after disconnecting half cluster Elasticsearch	1	384	February 21, 2020
Kibana: random error "Request Timeout after 30000ms" Elasticsearch	4	14252	February 27, 2017
Error: Request Timeout after 30000ms Elasticsearch	7	16717	December 1, 2017
Getting Timeout Exceptions with Elasticsearch Elasticsearch	12	10295	July 6, 2017
Elasticsearch causing Kibana flapping -- frequent time outs Elasticsearch	1	670	December 28, 2016

On-prem eck cluster disconnects after a couple hours and can't be reconnected

Related topics