Single Instance Quickstart Cluster Crashes after 10 Minutes With ECK 0.8.1

I've been having some difficulty following the ECK Quickstart.

To begin with I install the operator and instantiate a single node instance:

[root@a0002-flexnet ~]# kubectl apply -f https://download.elastic.co/downloads/eck/0.8.1/all-in-one.yaml
[SNIP]
[root@a0002-flexnet ~]# cat <<EOF | kubectl apply -f -
> apiVersion: elasticsearch.k8s.elastic.co/v1alpha1
> kind: Elasticsearch
> metadata:
>   name: quickstart
> spec:
>   version: 7.1.0
>   nodes:
>   - nodeCount: 1
>     config:
>       node.master: true
>       node.data: true
>       node.ingest: true
> EOF
elasticsearch.elasticsearch.k8s.elastic.co/quickstart created

[root@a0002-flexnet ~]# kubectl get elasticsearches.elasticsearch.k8s.elastic.co
NAME         HEALTH   NODES   VERSION   PHASE     AGE
quickstart   red              7.1.0     Pending   35s

[root@a0002-flexnet ~]# kubectl get elasticsearches.elasticsearch.k8s.elastic.co
NAME         HEALTH   NODES   VERSION   PHASE         AGE
quickstart   green    1       7.1.0     Operational   81s

At this point I can open a connection to the quickstart-es service and authenticate correctly.

So far so good!

A describe reveals that it looks healthy:

[root@a0002-flexnet ~]# kubectl describe elasticsearches.elasticsearch.k8s.elastic.co quickstart
Name:         quickstart
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"elasticsearch.k8s.elastic.co/v1alpha1","kind":"Elasticsearch","metadata":{"annotations":{},"name":"quickstart","namespace":...
API Version:  elasticsearch.k8s.elastic.co/v1alpha1
Kind:         Elasticsearch
Metadata:
  Creation Timestamp:  2019-07-17T20:21:03Z
  Finalizers:
    expectations.finalizers.elasticsearch.k8s.elastic.co
    observer.finalizers.elasticsearch.k8s.elastic.co
    secure-settings.finalizers.elasticsearch.k8s.elastic.co
    licenses.finalizers.elasticsearch.k8s.elastic.co
  Generation:        2
  Resource Version:  37578333
  Self Link:         /apis/elasticsearch.k8s.elastic.co/v1alpha1/namespaces/default/elasticsearches/quickstart
  UID:               661930de-a8d0-11e9-9f84-ac1f6b7678a2
Spec:
  Http:
    Service:
      Metadata:
      Spec:
    Tls:
  Nodes:
    Config:
      Node . Data:    true
      Node . Ingest:  true
      Node . Master:  true
    Node Count:       1
    Pod Template:
      Metadata:
        Creation Timestamp:  <nil>
      Spec:
        Containers:  <nil>
  Update Strategy:
  Version:  7.1.0
Status:
  Available Nodes:  1
  Cluster UUID:     GguK3wIwSAe_W2hWWIiVsg
  Health:           green
  Master Node:      quickstart-es-gbkkpdr7lm
  Phase:            Operational
  Service:          quickstart-es
  Zen Discovery:
    Minimum Master Nodes:  1
Events:
  Type    Reason       Age   From                      Message
  ----    ------       ----  ----                      -------
  Normal  Created      11m   elasticsearch-controller  Created pod quickstart-es-gbkkpdr7lm
  Normal  StateChange  10m   elasticsearch-controller  Master node is now quickstart-es-gbkkpdr7lm

Here's where things go off the rails a bit:

After about 10 minutes, the elasticsearch pod falls over.

{"type": "server", "timestamp": "2019-07-17T20:32:22,527+0000", "level": "INFO", "component": "o.e.x.m.p.NativeController", "cluster.name": "quickstart", "node.name": "quickstart-es-gbkkpdr7lm", "cluster.uuid": "GguK3wIwSAe_W2hWWIiVsg", "node.id": "pUETgQReRNuWE0mvNi6q-A",  "message": "Native controller process has stopped - no new native processes can be started"  }
{"level":"info","ts":1563395552.5241575,"logger":"process-manager","msg":"Update process state","action":"terminate","id":"es","state":"failed","pid":15}
{"level":"info","ts":1563395552.531126,"logger":"process-manager","msg":"HTTP server closed"}
{"level":"info","ts":1563395552.5324,"logger":"process-manager","msg":"Exit","reason":"process failed","code":-1}

Afterwards, the operator shows the service is degraded and it never recovers:

[root@a0002-flexnet ~]# kubectl describe elasticsearches.elasticsearch.k8s.elastic.co quickstart
Name:         quickstart
Namespace:    default
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"elasticsearch.k8s.elastic.co/v1alpha1","kind":"Elasticsearch","metadata":{"annotations":{},"name":"quickstart","namespace":...
API Version:  elasticsearch.k8s.elastic.co/v1alpha1
Kind:         Elasticsearch
Metadata:
  Creation Timestamp:  2019-07-17T20:21:03Z
  Finalizers:
    expectations.finalizers.elasticsearch.k8s.elastic.co
    observer.finalizers.elasticsearch.k8s.elastic.co
    secure-settings.finalizers.elasticsearch.k8s.elastic.co
    licenses.finalizers.elasticsearch.k8s.elastic.co
  Generation:        2
  Resource Version:  37583256
  Self Link:         /apis/elasticsearch.k8s.elastic.co/v1alpha1/namespaces/default/elasticsearches/quickstart
  UID:               661930de-a8d0-11e9-9f84-ac1f6b7678a2
Spec:
  Http:
    Service:
      Metadata:
      Spec:
    Tls:
  Nodes:
    Config:
      Node . Data:    true
      Node . Ingest:  true
      Node . Master:  true
    Node Count:       1
    Pod Template:
      Metadata:
        Creation Timestamp:  <nil>
      Spec:
        Containers:  <nil>
  Update Strategy:
  Version:  7.1.0
Status:
  Cluster UUID:  GguK3wIwSAe_W2hWWIiVsg
  Health:        red
  Master Node:   quickstart-es-gbkkpdr7lm
  Phase:         Pending
  Service:       quickstart-es
  Zen Discovery:
    Minimum Master Nodes:  1
Events:
  Type     Reason       Age   From                      Message
  ----     ------       ----  ----                      -------
  Normal   Created      11m   elasticsearch-controller  Created pod quickstart-es-gbkkpdr7lm
  Normal   StateChange  10m   elasticsearch-controller  Master node is now quickstart-es-gbkkpdr7lm
  Warning  Unhealthy    5s    elasticsearch-controller  Elasticsearch cluster health degraded

After some time the pod goes into Waiting: CrashLoopBackOff.

How do I start to troubleshoot this?
What would cause this single-instance test cluster to crash reliably after 10 minutes?

Many thanks!

-Z

Hard to say what is going on without knowing more about your environment. (Which OS/kernel version etc.)

We are aware of issues on older CentOS versions

One thing you can try to find out if your problem is related to Elasticsearch running out of resources is to up the Memory and CPU a bit and see if it makes a difference.

  - config:
      node.master: true
      node.data: true
      node.ingest: true
    podTemplate:
      spec:
        containers:
        - name: elasticsearch
          resources:
            limits:
              memory: 4Gi
              cpu: 2

Digging deeper on the OS side I agree this seems to be cgroup related.

Jul 18 09:32:05 node23 kernel: Task in /kubepods/burstable/pod6ed88fb3-a903-11e9-864b-ac1f6b7678b0/5b3e9d96f95fc08969fe2690e3552310c91449e7a81152defe1673c4f9bc6af8 killed as a result of limit of /kubepods/burstable/pod6ed88fb3-a903-11e9-864b-ac1f6b7678b0
Jul 18 09:32:05 node23 kernel: memory: usage 2097152kB, limit 2097152kB, failcnt 600739
Jul 18 09:32:05 node23 kernel: memory+swap: usage 2097152kB, limit 9007199254740988kB, failcnt 0
Jul 18 09:32:05 node23 kernel: kmem: usage 879048kB, limit 9007199254740988kB, failcnt 0
Jul 18 09:32:05 node23 kernel: Memory cgroup stats for /kubepods/burstable/pod6ed88fb3-a903-11e9-864b-ac1f6b7678b0: cache:0KB rss:0KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Jul 18 09:32:05 node23 kernel: Memory cgroup stats for /kubepods/burstable/pod6ed88fb3-a903-11e9-864b-ac1f6b7678b0/1fe6fe5df5fc423376671f88a8ce0b2efb78e039230f56bdc7d829f40f3ff6a3: cache:0KB rss:44KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
Jul 18 09:32:05 node23 kernel: Memory cgroup stats for /kubepods/burstable/pod6ed88fb3-a903-11e9-864b-ac1f6b7678b0/5b3e9d96f95fc08969fe2690e3552310c91449e7a81152defe1673c4f9bc6af8: cache:228KB rss:1217832KB rss_huge:1101824KB mapped_file:8KB swap:0KB inactive_anon:0KB active_anon:1217812KB inactive_file:152KB active_file:8KB unevictable:0KB
Jul 18 09:32:05 node23 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Jul 18 09:32:05 node23 kernel: [ 4625]     0  4625      253        1       3        0          -998 pause
Jul 18 09:32:05 node23 kernel: [10813]     0 10813    32437     2168      22        0           969 process-manager
Jul 18 09:32:05 node23 kernel: [10928]  1000 10928   927169   307557     670        0           969 java
Jul 18 09:32:05 node23 kernel: Memory cgroup out of memory: Kill process 11072 (elasticsearch[q) score 1556 or sacrifice child
Jul 18 09:32:05 node23 kernel: Killed process 10928 (java) total-vm:3708676kB, anon-rss:1208424kB, file-rss:21796kB, shmem-rss:0kB

Indeed this is Centos 7.6.1810

[root@node23 ~]# cat /etc/redhat-release
CentOS Linux release 7.6.1810 (Core)
[root@node23 ~]# uname -a
Linux node23 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux