Kibana no longer working

As of a few hours ago, Kibana is no longer working. it crashed suddenly and no the pod for Kibana is failing to achieve a Ready state. The ES pods are still working fine.

The most obvious errors from the kibana pods seem to relate to getting a licence from Elasticsearch for xpack:

{"type":"log","@timestamp":"2019-11-18T23:08:21Z","tags":["warning","task_manager"],"pid":1,"message":"PollError Request Timeout after 30000ms"}
{"type":"log","@timestamp":"2019-11-18T23:08:40Z","tags":["license","warning","xpack"],"pid":1,"message":"License information from the X-Pack plugin could not be obtained from Elasticsearch for the [data] cluster. Error: Request Timeout after 30000ms"}

ES version: 7.2
Kubernetes version: v1alpha1

I'm not sure why this would occur suddenly. Any ideas to diagnose?

Hi,

Could you check the connectivity to the Elasticsearch cluster , see https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-quickstart.html#k8s_request_elasticsearch_access for more information about how to do this.

No issues connecting to, or querying elasticsearch.

I'm still unsure what happened or how to fix it. Last thing I did was create a new index, took a break, and came back in 30-60mins and Kibana could no longer connect to ES. :confounded:

Resolved by deploying an new eck cluster and reindexing. I was due to update anyways.

Another update on the issue of a loadbalancer breaking the connection from kibana to elastic search. It wasn't resolved, I can only assume :ghost:'s where to blame for it working briefly on the beta for a hour or so and months on the alpha without the connection breaking. But i have found a solution that works.

The issue seems to be the internal connection / dns breaks when you expose the ES service as a loadbalancer. The best solution I've found is to specify the following Kibana config settings in the kibana.yaml this allows kibana to connect through the loadbalancer. eg:

...
spec:
  version: 7.4.2
  count: 1
  elasticsearchRef:
    name: default
  config:
    elasticsearch.hosts: https://YOUR_DOMAIN_NAME
    elasticsearch.username: elastic
    elasticsearch.password: ELASTIC_PASSWORD|SECRET

What the fuck :s

I deployed a simple config change to ES adding reindex.remote.whitelist: https://example:443 and Kibana now failed to connect.

Tried to recreate the cluster as before and it's not working. Kibana appears to be trying to connect on the internal DNS address, and ignoring the config settings above.

In addition, the ES cluster health is never getting to 'green' only 'unsure'. Trouble shooting and looking at the logs for ES there are no errors or suspicious logs on the ES pods, they look fine.

What the god damn fuck?

Hi @getorca,

Can you share your entire yaml manifests?
If that helps, you can also create your own LoadBalancer service targeting the Elasticsearch Pods, and keep the default one managed by ECK "internal". So Kibana uses the internal one.

The yamls below use the Digital Ocean load balancer annotations. I'm attempting to run them on Digital ocean managed Kubernetes, Kubernetes version 1.16.2

Elastic Search:

apiVersion: elasticsearch.k8s.elastic.co/v1beta1
kind: Elasticsearch
metadata:
  name: hugo
  annotations:
    service.beta.kubernetes.io/do-loadbalancer-protocol: "http"
    service.beta.kubernetes.io/do-loadbalancer-algorithm: "round_robin"
    service.beta.kubernetes.io/do-loadbalancer-tls-ports: "443"
    service.beta.kubernetes.io/do-loadbalancer-certificate-id: "MY_DO_CERT_ID"
    service.beta.kubernetes.io/do-loadbalancer-redirect-http-to-https: "true"
spec:
  version: 7.4.2
  nodeSets:
  - name: default
    count: 3
    podTemplate:
      spec:
        initContainers:
         - name: set-sysctl
           securityContext:
             privileged: true 
           command: 
           - sh
           - -c
           - |
             sysctl -w vm.max_map_count=262144
         - name: install-plugins
           command:
           - sh
           - -c
           - |
             bin/elasticsearch-plugin install --batch repository-s3
    config:
      node.master: true
      node.data: true
      node.ingest: true
      reindex.remote.whitelist:  example.com:443
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 250Gi
        storageClassName: do-block-storage
  updateStrategy:
    changeBudget:
      maxSurge: 3
      maxUnavailable: 1
  http:
    service:
      spec:
        type: LoadBalancer
        ports:
          - name: https
            protocol: TCP
            port: 443
            targetPort: 9200
          - name: http
            protocol: TCP
            port: 80
            targetPort: 9200

Kibana:

apiVersion: kibana.k8s.elastic.co/v1beta1
kind: Kibana
metadata:
  name: hugo
  annotations:
    service.beta.kubernetes.io/do-loadbalancer-protocol: "http"
    service.beta.kubernetes.io/do-loadbalancer-algorithm: "round_robin"
    service.beta.kubernetes.io/do-loadbalancer-tls-ports: "443"
    service.beta.kubernetes.io/do-loadbalancer-certificate-id: "MY_DO_CERT_ID"
    service.beta.kubernetes.io/do-loadbalancer-redirect-http-to-https: "true"
spec:
  version: 7.4.2
  count: 1
  elasticsearchRef:
    name: hugo
  config:
    elasticsearch.hosts: https://huge-es-01.example.com
    elasticsearch.username: elastic
    elasticsearch.password: MY_ES_PASSWORD
  http:
    service:
      spec:
        type: LoadBalancer
        ports:
          - name: https
            protocol: TCP
            port: 443
            targetPort: 5601

To recap the steps that led to the issue:

1 - Successfully deployed both ES and Kibana with the above yamls, everything worked, health was green
2 - Updated the value of reindex.remote.whitelist to es.example.com:443. ES health stay green, stuck at 2 ES nodes, Kibana can no longer connect to the ES cluster.
3 - Spin up new DO Kuberenets cluster. Try to deploy with the above yaml files again. ES works but the health is "unsure", all 3 pods look like they are working. Querying ES works. Kibana can't connect to ES cluster.

Updated the value of reindex.remote.whitelist to es.example.com:443 . ES health stay green, stuck at 2 ES nodes

Looks like the rolling upgrade did not go fine. If 1/3 Pods is not available (probably the one being upgraded) you can look at its logs (the Elasticsearch logs) to see if anything's wrong with its configuration. And maybe learn more about the failing reindex from cluster.

Do things work correctly if you unset the LoadBalancer type service?
We've seen other folks having issues with LoadBalancer services not being reachable using the internal DNS. A workaround is to create your own additional LoadBalancer service, see this example. Let me know what happens in your case!

There where no errors or suspicious logs from any of the ES pods. I will try and recreate one more time.

Shouldn't setting the kibana config elasticsearch.hosts and username & password to use the full dns and auth bypass the internal DNS?

I had tried that, but was getting a 504 bad gateway, with both Kibana and Elasticsearch so went back to using the loadbalancer since it had worked for the past 3 months. And deployed correctly the first time I set the host config value in Kibana. :s

@sebgl

I tried to deploy again on a brand new cluster. All 3 elastic search pods are running. There are no errors or unusual looking logs on any of the ES pods. The ES cluster health is still "unknown". I can connect to and query ES no problem.

The only error is from the ECK stateful set:

{"level":"error","@timestamp":"2019-11-26T18:12:11.759Z","logger":"controller-runtime.controller","message":"Reconciler error","ver":"1.0.0-beta1-84792e30","controller":"elasticsearch-controller","request":"default/hola","error":"unable to delete /_cluster/voting_config_exclusions: Delete https://hola-es-http.default.svc:9200/_cluster/voting_config_exclusions?wait_for_removal=false: dial tcp 10.245.223.51:9200: connect: connection timed out","errorCauses":[{"error":"unable to delete /_cluster/voting_config_exclusions: Delete https://hola-es-http.default.svc:9200/_cluster/voting_config_exclusions?wait_for_removal=false: dial tcp 10.245.223.51:9200: connect: connection timed out","errorVerbose":"Delete https://hola-es-http.default.svc:9200/_cluster/voting_config_exclusions?wait_for_removal=false: dial tcp 10.245.223.51:9200: connect: connection timed out\nunable to delete /_cluster/voting_config_exclusions\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client.(*clientV7).DeleteVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/client/v7.go:53\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2.ClearVotingConfigExclusions\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/version/zen2/voting_exclusions.go:78\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).reconcileNodeSpecs\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/nodes.go:92\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver.(*defaultDriver).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/driver/driver.go:234\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).internalReconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:284\ngithub.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch.(*ReconcileElasticsearch).Reconcile\n\t/go/src/github.com/elastic/cloud-on-k8s/pkg/controller/elasticsearch/elasticsearch_controller.go:219\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:216\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"}],"stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:218\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:192\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.2.1/pkg/internal/controller/controller.go:171\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190404173353-6a84e37a896d/pkg/util/wait/wait.go:88"}

So it looks like ES needs to connect to itself on the internal DNS. is there a way to set the ES to use an external DNS?

@getorca ECK does need to connect to ES using the internal DNS anyway. So this needs to work.
I'd suggest again you keep the internal DNS and related ES & Kibana configuration default, so ECK can also connect to it.
Adding an additional LoadBalancer service type should normally work as expected. Your 504 bad gateway probably comes from a wrong service configuration?

My mistake, it's a 502 Bad Gateway. My lb service yaml is as follows:

apiVersion: v1
kind: Service
metadata:
  name: es-loadbalancer
  annotations:
    service.beta.kubernetes.io/do-loadbalancer-protocol: "http"
    service.beta.kubernetes.io/do-loadbalancer-algorithm: "round_robin"
    service.beta.kubernetes.io/do-loadbalancer-certificate-id: "MY_CERT_ID"
    service.beta.kubernetes.io/do-loadbalancer-redirect-http-to-https: "true"
spec:
  externalTrafficPolicy: Local
  type: LoadBalancer
  ports:
  - name: https
    protocol: TCP
    port: 443
    targetPort: 9200
  selector:
    common.k8s.elastic.co/type: elasticsearch
    elasticsearch.k8s.elastic.co/cluster-name: sample

It seems like I have everthing working, with the above loadbalancer service and disabling tls on in the elasticsearch yaml like @sebgl described here Public SSL'ed access with Ingress not working in option 3, i guess that option is now available (https://www.elastic.co/guide/en/cloud-on-k8s/current/k8s-accessing-elastic-services.html#k8s-disable-tls).