Elastic Agent won't enroll

Hi,
We are having difficulties with some agents not enrolling properly into fleet especially on Windows machines.
The setup seems to be ok as we can enroll from some servers.
However on other servers when launching the install command, agent shows up but stays on "updating".

When looking at the Api Keys section in Kibana we see that only one key has been created when usually two keys are created when everything works (one key + key:default).

Logs don't say anything as the agent believes everything is ok.
The only temporary solution found is to enroll and then install but this setup doesn't seem to appreciate reboots.
Thanks for the help !

@greggailly Is there any errors provided in the Elastic Agent logs locally on the machine?

blaker Unfortunately no. The only thing we can see is that the "filebeat" part of the agent is trying to send data to localhost:9200 even though it should be our domain.
That being said we are not too surprised about that as it is the default for the agents logs until our policy should tell the agent otherwise.

@greggailly Being that its a Windows machine you should have logs in C:\Program Files\Elastic\Agent\elastic-agent.log as well as extra logs in C:\Program Files\Elastic\Agent\data\elastic-agent-${version\logs.

Those can help understand what is going wrong on those machines.

@blaker Ok so we did find the following error in the "extra" logs:

{"log.level":"error","@timestamp":"2021-08-23T16:30:51.343Z","log.origin":{"file.name":"fleet/fleet_gateway.go","file.line":205},"message":"Could not communicate with fleet-server Checking API will retry, error: status code: 400, fleet-server returned an error: BadRequest","ecs.version":"1.6.0"}

Why are we getting this when using the install command not when enrolling and then installing ?
Where could we find more info on this 400 on the fleet-server side ? (It is running in a docker container using elastic-agent image)
Thanks !

@blaker After reading this post: Elastic Agent 7.14 -- Strange bug during enrollment "Elastic fleet agent bug" it seems we are experiencing similar problems.
Furthermore I can confirm that we are having this problem on Windows Server 2016/2019 but NOT on Windows Server 2012.
We have two machines behind the same network, one running Windows Server 2012 which enrolls properly and one running Windows Server 2016 which fails to enroll.
I hope this can help !
Cheers

@greggailly Did you setup the Fleet Server or is the Fleet Server running in cloud.elastic.co?

@blaker We set it up ourselves. It runs in Docker here is the part of the docker-compose concerning fleet:

  fleet:
    image: docker.elastic.co/beats/elastic-agent:7.14
    container_name: fleet
    restart: unless-stopped
    ports:
      - "80:8220"
    networks:
      - elk
    depends_on:
      - elasticsearch
      - kibana
    hostname: docker-fleet-server
    environment:
      - FLEET_SERVER_ENABLE=1
      - FLEET_SERVER_INSECURE_HTTP=1
      - FLEET_SERVER_SERVICE_TOKEN=SERVICE_TOKEN
      - FLEET_SERVER_ELASTICSEARCH_HOST=https://ourdomain.com:443
      - FLEET_INSECURE=1
      - FLEET_SERVER_HOST=fleet

We just ran another test on another network:
Mac OS, Debian, Windows 10 enrolls OK.
Windows Server 2016 stays stuck on "updating" -> we could also notice that the agent is logging the following {"log.level":"info","@timestamp":"2021-08-24T06:39:53.472Z","log.origin":{"file.name":"application/periodic.go","file.line":79},"message":"Configuration changes detected","ecs.version":"1.6.0"} every 10 seconds and never stops.

@greggailly That seems like that Windows Server 2016 is not running in Fleet mode. the application/periodic.go is only used when running in stand-alone mode (aka. not enrolled in Fleet).

Did the install command with enrollment say it worked correctly? Because that Elastic Agent is not communicating with Fleet Server.

@blaker Yes, no errors on the install command:

2021-08-25T21:58:48.262+0200    WARN    [tls]   tlscommon/tls_config.go:98      SSL/TLS verifications disabled.
2021-08-25T21:58:48.804+0200    INFO    cmd/enroll_cmd.go:414   Starting enrollment to URL: http://ourdomain.com:80/
2021-08-25T21:58:51.684+0200    INFO    cmd/enroll_cmd.go:250   Elastic Agent might not be running; unable to trigger restart
2021-08-25T21:58:51.684+0200    INFO    cmd/enroll_cmd.go:252   Successfully triggered restart on running Elastic Agent.

After a few more test we got to the following conclusion:
When installing (with enrollment) the agent does not start in fleet mode at first (use of periodic.go & status at "updating" in fleet).
However after manually restarting the service, the Agent seems to finally communicate with fleet.
This is rather good news for the moment but we are still curious to know why this is a problem only on Windows server 2016/2019.

Yes it is strange that you have the behavior on Windows. I have recently tested Elastic Agent on Windows Server 2019 Datacenter in GCP with much success. So I am surprise you are seeing this behavior.

Based on your last message it seems that the enrollment command cannot find the running daemon per the message Elastic Agent might not be running; unable to trigger restart. So that is why it is not restarting the service. The open question is why can it not find it.

Are you using the standard install command provided with Elastic Agent? Or are you doing something custom to install the Elastic Agent on the Windows host?

Yes we are using the standard install command.
For the moment we manually restart the service to ensure it connects to Kibana correctly.