Fleet Server Fails At Start, No Reason Given

Hello. I've been looking around to see if anyone else has experienced a similar issue but I haven't found anything. I've setup an Elasticsearch cluster and Kibana all using SSL certificates created following the Elastic Stack Basic Security Guide. I'm now attempting to install and enroll a Fleet Server on the same machine as my Kibana instance. After following the steps here, I end up with the following command to run:

sudo ./elastic-agent install -f \
  --url=https://34.xxx.xxx.xxx:8220 \
  --fleet-server-es=https://3.xxx.xxx.xxx:9200 \
  --fleet-server-service-token=<NEWLY_GENERATED_SERVICE_TOKEN> \
  --fleet-server-policy=<DEFAULT_POLICY_WITH_FLEET_SERVER_INTEGRATION> \
  --fleet-server-es-ca=/etc/kibana/elasticsearch-ca.pem \
  --certificate-authorities=/etc/kibana/fleet-server-certs/fleet-ca.crt \
  --fleet-server-cert=/etc/kibana/fleet-server-certs/fleet-server.crt \
  --fleet-server-cert-key=/etc/kibana/fleet-server-certs/fleet-server.key

I've tried numerous other versions of the same command including the quick start version which creates self-signed certificates. However, I always get exactly the same error code with no additional information on what's wrong.

YYYY-MM-DDTHH:MM:SS.sssZ  INFO  cmd/enroll_cmd.go:776  Fleet Server - Starting
Error: fleet-server failed: context canceled
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/7.17/fleet-troubleshooting.html
Error: enroll command failed with exit code: 1
For help, please see our troubleshooting guide at https://www.elastic.co/guide/en/fleet/7.17/fleet-troubleshooting.html

As I get no additional information I have no idea what's wrong. Every other post I've found with fleet server starting issues have a more specific error code or at the very least have additional entries before failure, I only have one: "Fleet Server - Starting". Note that I've already checked the Kibana and Elastic Agent versions. What issues could be causing the fleet server installation to error right at the start of the process? Are there logs I could examine to find the issue? Any help is greatly appreciated!

EDIT: I've recreated the the fleet server certs to see if that would fix the issue. I included -ip 34.xxx.xxx.xxx (public ip),10.xxx.xxx.xxx (private ip),0.0.0.0 in its cert with no -dns arg since I'm just using host names for this initial deployment. I also re-downloaded the agent and placed it in /opt/ along with all the necessary certs. My command looks pretty much the same. Note that I changed the ES IP to another ES host for an unrelated reason.

sudo ./elastic-agent install --url=https://34.xxx.xxx.xxx:8220 \
  --fleet-server-es=https://54.xxx.xxx.xxx:9200 \
  --fleet-server-service-token=AAEAAWVsYXN0aWMvZmxlZXQtc2VydmVyL3Rva2VuLTE2NDY0MTYxOTYxMTk6bG9OYlFFdWpULXlFX0h5ek81MmZzZw \
  --fleet-server-policy=499b5aa7-d214-5b5d-838b-3cd76469844e \
  --certificate-authorities=/opt/elastic-agent-7.17.0-linux-x86_64/ca.crt \
  --fleet-server-es-ca=/opt/elastic-agent-7.17.0-linux-x86_64/elasticsearch-ca.pem \
  --fleet-server-cert=/opt/elastic-agent-7.17.0-linux-x86_64/fleet-server.crt \
  --fleet-server-cert-key=/opt/elastic-agent-7.17.0-linux-x86_64/fleet-server.key

I still get the same error with no additional information.

EDIT 2: I installed the agent first before running the command to see if that would help and it did get additional information.

$~ ./elastic-agent install -f

$~ ./elastic-agent enroll -f <previous args>

Response:

YYYY-MM-DDTHH:MM:SS.sssZ        INFO    cmd/enroll_cmd.go:571   Spawning Elastic Agent daemon as a subprocess to complete bootstrap process.
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    application/application.go:67   Detecting execution mode
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    application/application.go:88   Agent is in Fleet Server bootstrap mode
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    [api]   api/server.go:62        Starting stats endpoint
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    application/fleet_server_bootstrap.go:130       Agent is starting
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    [api]   api/server.go:64        Metrics endpoint listening on: /opt/elastic-agent-7.17.0-linux-x86_64/data/tmp/elastic-agent.sock (configured: unix:///opt/elastic-agent-7.17.0-linux-x86_64/data/tmp/elastic-agent.sock)
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    application/fleet_server_bootstrap.go:140       Agent is stopped
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    stateresolver/stateresolver.go:48       New State ID is iLJi9-Kz
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    stateresolver/stateresolver.go:49       Converging state requires execution of 1 step(s)
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    log/reporter.go:40      YYYY-MM-DDTHH:MM:SS.sssZ - message: Application: fleet-server--7.17.0[]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    stateresolver/stateresolver.go:66       Updating internal state
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    cmd/enroll_cmd.go:776   Fleet Server - Starting
YYYY-MM-DDTHH:MM:SS.sssZ        ERROR   status/reporter.go:236  Elastic Agent status changed to: 'error'
YYYY-MM-DDTHH:MM:SS.sssZ        ERROR   log/reporter.go:36      YYYY-MM-DDTHH:MM:SS.sssZ - message: Application: fleet-server--7.17.0[]: State changed to FAILED: Error - dial tcp 54.xxx.xxx.xxx:9200: i/o timeout - type: 'ERROR' - sub_type: 'FAILED'
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    status/reporter.go:236  Elastic Agent status changed to: 'online'
YYYY-MM-DDTHH:MM:SS.sssZ        INFO    log/reporter.go:40      YYYY-MM-DDTHH:MM:SS.sssZ - message: Application: fleet-server--7.17.0[]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
YYYY-MM-DDTHH:MM:SS.sssZ        ERROR   status/reporter.go:236  Elastic Agent status changed to: 'error'
YYYY-MM-DDTHH:MM:SS.sssZ        ERROR   log/reporter.go:36      YYYY-MM-DDTHH:MM:SS.sssZ - message: Application: fleet-server--7.17.0[]: State changed to FAILED: Error - dial tcp 54.xxx.xxx.xxx: i/o timeout - type: 'ERROR' - sub_type: 'FAILED'

It then continues attempting to restart but keeps getting this dial tcp error with the Elasticsearch node IP address.

This problem has been solved. It's difficult to say exactly what the problem was but one important modification one was changing the public IPs to private ones in the enroll command. Eventually I did successfully use the install command (not the enroll command) to get the fleet server up. I'm not sure if this changed anything but I also changed the ES IP to the master node's IP. I have to say, I hope Elastic improves and expands the trouble shooting documentation for specifically setting up the Fleet Server (not just Elastic Agents). What made this so difficult was not being sure what was wrong. At some point I had forgotten to re-copy my certs into my elastic agent folder and the installer didn't give me a "file not found" error or a "cert error" just failed check-ins.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.