I have a combination of problem symptoms that may be caused by multiple issues, but since I don't know for sure what is the cause I'm compiling my issues into one thread.
First, my ECK is self-managed. There are two things in my ECK that are different from the base ECK helm charts :
-
I use Traefik for Ingress, and create IngressRoutes with TLS that terminate at the ingress. I use serverTransport = insecureVerify so that the IngressRoute routes me to ECK's self-signed certificates without throwing an error.
-
For kibana config, I use public URLs for the Elasticsearch host and Fleet Agent host :
config:
# Note that these are specific to the namespace into which this example is installed, and are
# using `elastic-stack` as configured here and detailed in the README when installing:
#
# `helm install es-kb-quickstart elastic/eck-stack -n elastic-stack`
#
# If installed outside of the `elastic-stack` namespace, the following 2 lines need modification.
xpack.fleet.agents.elasticsearch.hosts: ["https://elasticsearch.example.com"]
xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server.example.com"]
xpack.fleet.outputs:
- id: fleet-default-output
name: default
type: elasticsearch
hosts: [ "https://elasticsearch.example.com" ]
# openssl x509 -fingerprint -sha256 -noout -in tls/kibana/elasticsearch-ca.pem (colons removed)
is_default: true
is_default_monitoring: true
config:
ssl:
certificate_authorities:
- |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
ca_trusted_fingerprint: "<my fingerprint>"
#Verify package exists by : https://epr.elastic.co/package/<package name>/<version>/
#Example : https://epr.elastic.co/package/system/latest/
xpack.fleet.packages:
- name: system
version: latest
- name: elastic_agent
version: latest
- name: fleet_server
version: latest
- name: kubernetes
version: latest
- name: network_traffic
version: latest
xpack.fleet.agentPolicies:
- name: Fleet Server on ECK policy
id: fleet-server
namespace: default
- Instead of using the ECK Helm Chart's kind : "Agent" I use the yaml file from here. The reason is because in order for metricbeat to send data related to kube-system the agents and related clusterroles have to be deployed on the kube-system namespace rather than the elastic-stack namespace.
Now, everything works fine at first. The Elastic Agents that I deployed using the DaemonSet in the above file shows as healthy, and when I check data streams the agents are sending data as well. All is good, right? But weirdly, I'm facing the following problems :
- The agent policies keep showing "Out of date" and then increase the revision number and then repeat infinitely :
- The bigger problem : The agents just randomly stop sending data despite showing "Healthy" :
I have tried : Increasing CPU and Memory of the Elastic Agents. Increasing the CPU and Memory of Elasticsearch Hot nodes. These don't work. The weirdest thing is if I log into kibana and look at Fleet page constantly, the agents will keep sending data. But the moment I leave Kibana alone, the agents for some reason go into this hibernation and stop sending data.
What I think is happening :
-
If I keep refreshing the Fleet page on Kibana, Kibana keeps updating the Agents which somehow make the Agents work properly for a little while. Thus the revision numbers of the policies increasing constantly.
-
If I don't refresh the Fleet page, Kibana doesn't do anything, and the agents for some reason go "Out of date" and stops sending data.
What I think is the root cause of the problem :
- Certificate authorities. I think Kibana is overwriting itself by constantly updating the agents with two different CA bundle data, one CA bundle that is meant to be used only with ECK's self signed certs, and another CA bundle that updates with the Kibana config that I set with the fingerprint and certificate.
I think this is the case because of what it says in the logs of the agent :
{"log.level":"error","@timestamp":"2023-03-31T04:43:02.497Z","message":"Failed to connect to backoff(elasticsearch(https://elasticsearch.example.com:443)): Get \"https://elasticsearch.example.com:443\": x509: certificate signed by unknown authority","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.origin":{"file.line":150,"file.name":"pipeline/client_worker.go"},"service.name":"metricbeat","ecs.version":"1.6.0","log.logger":"publisher_pipeline_output","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2023-03-31T04:43:02.497Z","message":"Attempting to reconnect to backoff(elasticsearch(https://elasticsearch.example.com:443)) with 11 reconnect attempt(s)","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.logger":"publisher_pipeline_output","log.origin":{"file.line":141,"file.name":"pipeline/client_worker.go"},"service.name":"metricbeat","ecs.version":"1.6.0","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2023-03-31T04:43:02.504Z","message":"Error dialing x509: certificate signed by unknown authority","component":{"binary":"metricbeat","dataset":"elastic_agent.metricbeat","id":"beat/metrics-monitoring","type":"beat/metrics"},"log":{"source":"beat/metrics-monitoring"},"log.logger":"esclientleg","log.origin":{"file.line":38,"file.name":"transport/logging.go"},"service.name":"metricbeat","network":"tcp","address":"elasticsearch.example.com:443","ecs.version":"1.6.0","ecs.version":"1.6.0"}
[root@k3s-node1 eck]#
I get these CA errors and connection failed errors despite the agents showing healthy at any time I look at the Fleet page, and despite the agents sending data whenever I log into Kibana.
Why I think Kibana config is the problem and it keeps overwriting itself :
I have tried deploying the agents with a DaemonSet and with the Agent helm chart and experienced the same issue.
My guess at the cause is because I have both the ca fingerprint and the ssl.certificate set in the kibana config. So I tried deleting one or the other, but if I delete either one of them the agents fail to connect to Elasticsearch. If I have both filled the agents connect to Elasticsearch and send data, but run to the above problem.
What am I doing wrong here? If the agents was not sending data to begin with then at the very least I would know I'm doing something wrong, but in this case the agents successfully send data and is healthy but stops working.