Unable to communicate with Fleet Server after Upgrade to 7.14

Hi,

after Upgrading to 7.14 (elasticsearch and elastic-agent hosting fleet-server) i'm unable to start other elastic-agents version 7.14.

2021-08-04T09:11:32.489+0200 ERROR fleet/fleet_gateway.go:205 Could not communicate with fleet-server Checking API will retry, error: fail to decode checkin response: unexpected EOF

I tried to unenroll and enroll the agent again, but without success.

Older 7.13.x-Agents can connect to Fleet Server 7.14.

Do you have any idea whats wrong here ?

kindly
Thomas

I assume you are running all on your end and not using Elastic Cloud?

For the Elastic Agent running fleet-server, could you share the full logs from Elastic Agent itself and the logs from fleet-server (data/elastic-agent-/logs/default/)?

In the Fleet UI, does the Elastic Agent with the Fleet sever show as healthy? How did you upgrade? Manually or through the Fleet?

Hi ruflin,

you are right. It's a on prem installation. I upgraded the rpm of elastic-agent with linux package manager (apt).

In Fleet UI die Agent is pending between Healthy and Offline. No Host Metrics are transfered to elasticsearch.

The Fleet-Server logfiles looks like this:

{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","@timestamp":"2021-08-04T07:05:36.812Z","message":"received initial configuration starting Fleet Server"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","cfg":{"NumCounters":500000,"MaxCost":52428800,"ActionTTL":300000000000,"ApiKeyTTL":900000000000,"EnrollKeyTTL":60000000000,"ArtifactTTL":86400000000000,"ApiKeyJitter":300000000000},"@timestamp":"2021-08-04T07:05:36.812Z","message":"makeCache"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","status":"STARTING","@timestamp":"2021-08-04T07:05:36.813Z","message":"Starting"}
{"service.name":"fleet-server","service.name":"fleet-server","message":"Starting stats endpoint","log.level":"info","log.logger":"fleet-metrics.api","@timestamp":"2021-08-04T07:05:36.814Z"}
{"service.name":"fleet-server","service.name":"fleet-server","log.level":"info","log.logger":"fleet-metrics.api","message":"Metrics endpoint listening on: /var/lib/elastic-agent/data/tmp/default/fleet-server/fleet-server.sock (configured: unix:///var/lib/elastic-agent/data/tmp/default/fleet-server/fleet-server.sock)","@timestamp":"2021-08-04T07:05:36.814Z"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","name":"elasticsearch","uuid":"87La0hWBRe-rL1FsHB5qrw","vers":"7.14.0","@timestamp":"2021-08-04T07:05:36.826Z","message":"Cluster Info"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","opts":{"flushInterval":250,"flushThresholdCnt":2048,"flushThresholdSz":1048576,"maxPending":8,"blockQueueSz":32,"apikeyMaxParallel":120},"@timestamp":"2021-08-04T07:05:36.826Z","message":"Run bulker with options"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","fleet_version":"7.14.0","elasticsearch_version":"7.14.0","@timestamp":"2021-08-04T07:05:36.827Z","message":"versions are compatible"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","name":"elasticsearch","uuid":"87La0hWBRe-rL1FsHB5qrw","vers":"7.14.0","@timestamp":"2021-08-04T07:05:36.832Z","message":"Cluster Info"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","limits":{"Interval":1000000,"Burst":1000,"Max":0,"MaxBody":1048576},"long_poll_timeout":300000,"long_poll_timestamp":30000,"long_poll_jitter":30000,"@timestamp":"2021-08-04T07:05:36.832Z","message":"Checkin install limits"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","limits":{"Interval":10000000,"Burst":100,"Max":50,"MaxBody":524288},"@timestamp":"2021-08-04T07:05:36.832Z","message":"Enroller install limits"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","limits":{"Interval":5000000,"Burst":25,"Max":50,"MaxBody":0},"maxParallel":8,"@timestamp":"2021-08-04T07:05:36.832Z","message":"Artifact install limits"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","limits":{"Interval":10000000,"Burst":100,"Max":50,"MaxBody":2097152},"@timestamp":"2021-08-04T07:05:36.832Z","message":"Ack install limits"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","method":"GET","path":"/api/status","@timestamp":"2021-08-04T07:05:36.832Z","message":"Server install route"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","method":"POST","path":"/api/fleet/agents/:id","@timestamp":"2021-08-04T07:05:36.832Z","message":"Server install route"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","method":"POST","path":"/api/fleet/agents/:id/checkin","@timestamp":"2021-08-04T07:05:36.832Z","message":"Server install route"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","method":"POST","path":"/api/fleet/agents/:id/acks","@timestamp":"2021-08-04T07:05:36.832Z","message":"Server install route"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","method":"GET","path":"/api/fleet/artifacts/:id/:sha2","@timestamp":"2021-08-04T07:05:36.832Z","message":"Server install route"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","bind":"0.0.0.0:8220","rdTimeout":60000,"wrTimeout":600000,"@timestamp":"2021-08-04T07:05:36.832Z","message":"server listening"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","@timestamp":"2021-08-04T07:05:36.832Z","message":"server hard connection limiter disabled"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","index":".fleet-policies","ctx":"index monitor","@timestamp":"2021-08-04T07:05:36.833Z","message":"start"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy agent monitor","throttle":5,"@timestamp":"2021-08-04T07:05:36.833Z","message":"run policy monitor"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","index":".fleet-actions","ctx":"index monitor","@timestamp":"2021-08-04T07:05:36.833Z","message":"start"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","status":"HEALTHY","@timestamp":"2021-08-04T07:05:37.11Z","message":"Running on default policy with Fleet Server integration"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy leader manager","policy_id":"4b1288a0-f122-11eb-b092-4bd59be5dafa","coordinator":"v0","@timestamp":"2021-08-04T07:05:38.416Z","message":"starting coordinator for policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy leader manager","policy_id":"a3efe3e0-77ac-11eb-b2f5-cbacb66767fb","coordinator":"v0","@timestamp":"2021-08-04T07:05:38.416Z","message":"starting coordinator for policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy leader manager","policy_id":"742cae30-c438-11eb-9ca7-a7ee6e2b28e6","coordinator":"v0","@timestamp":"2021-08-04T07:05:38.416Z","message":"starting coordinator for policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy leader manager","policy_id":"5b5b6f40-7aaa-11eb-b2f5-cbacb66767fb","coordinator":"v0","@timestamp":"2021-08-04T07:05:38.416Z","message":"starting coordinator for policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy leader manager","policy_id":"e577bb40-8d63-11eb-8832-232d30289a82","coordinator":"v0","@timestamp":"2021-08-04T07:05:38.416Z","message":"starting coordinator for policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy agent monitor","policyId":"a3efe3e0-77ac-11eb-b2f5-cbacb66767fb","rev":19,"coord":1,"@timestamp":"2021-08-04T07:05:40.435Z","message":"new policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy agent monitor","policyId":"742cae30-c438-11eb-9ca7-a7ee6e2b28e6","orev":0,"nrev":11,"ocoord":0,"ncoord":1,"@timestamp":"2021-08-04T07:05:40.436Z","message":"policy revised"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy agent monitor","policyId":"742cae30-c438-11eb-9ca7-a7ee6e2b28e6","revisionIdx":11,"coordinatorIdx":1,"@timestamp":"2021-08-04T07:05:40.436Z","message":"no pending subscriptions to revised policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy agent monitor","policyId":"5b5b6f40-7aaa-11eb-b2f5-cbacb66767fb","rev":15,"coord":1,"@timestamp":"2021-08-04T07:05:40.437Z","message":"new policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy agent monitor","policyId":"e577bb40-8d63-11eb-8832-232d30289a82","rev":16,"coord":1,"@timestamp":"2021-08-04T07:05:40.438Z","message":"new policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","ctx":"policy agent monitor","policyId":"4b1288a0-f122-11eb-b092-4bd59be5dafa","rev":2,"coord":1,"@timestamp":"2021-08-04T07:05:40.438Z","message":"new policy"}
{"log.level":"info","service.name":"fleet-server","service.name":"fleet-server","status":"HEALTHY","@timestamp":"2021-08-04T07:05:42.384Z","message":"Running on default policy with Fleet Server integration"}

The fleet-server log looks ok. Is the following error message from the Elastic Agent that is running fleet-server or one that tries to enroll into it:

2021-08-04T09:11:32.489+0200 ERROR fleet/fleet_gateway.go:205 Could not communicate with fleet-server Checking API will retry, error: fail to decode checkin response: unexpected EOF

What fleet-server url do you have set in the Fleet UI? Do you use any certificates? Any chance you could share the two following logs in addition:

  • Elastic Agent 7.14 logs that tries to connect to the Elastic Agent with fleet-server
  • Elastic Agent 7.14 logs of the one with fleet-server

Are you using any certificates? Can you share your enrollment command?

What bugs me is that you are mentioning that 7.13 is just fine and only 7.14 Elastic Agents have issues enrolling. @blaker Any ideas on your end on what could cause this?

I cannot see any issue within both logfiles:

elastic-agent connecting to fleet

2021-08-04T08:34:05.469+0200    INFO    application/application.go:66   Detecting execution mode
2021-08-04T08:34:05.470+0200    INFO    application/application.go:91   Agent is managed by Fleet
2021-08-04T08:34:05.470+0200    INFO    capabilities/capabilities.go:59 capabilities file not found in /etc/elastic-agent/capabilities.yml
2021-08-04T08:34:05.470+0200    WARN    [tls]   tlscommon/tls_config.go:98      SSL/TLS verifications disabled.
2021-08-04T08:34:06.907+0200    INFO    [composable.providers.docker]   docker/docker.go:43     Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2021-08-04T08:34:06.907+0200    INFO    [api]   api/server.go:62        Starting stats endpoint
2021-08-04T08:34:06.908+0200    INFO    application/managed_mode.go:291 Agent is starting
2021-08-04T08:34:06.908+0200    INFO    [api]   api/server.go:64        Metrics endpoint listening on: /var/lib/elastic-agent/data/tmp/elastic-agent.sock (configured: unix:///var/lib/elastic-agent/data/tmp/elastic-agent.sock)
2021-08-04T08:34:07.008+0200    WARN    application/managed_mode.go:304 failed to ack update open /var/lib/elastic-agent/data/.update-marker: no such file or directory
2021-08-04T08:34:07.315+0200    WARN    [tls]   tlscommon/tls_config.go:98      SSL/TLS verifications disabled.
2021-08-04T08:34:08.387+0200    ERROR   fleet/fleet_gateway.go:205      Could not communicate with fleet-server Checking API will retry, error: fail to decode checkin response: unexpected EOF
2021-08-04T08:35:28.849+0200    ERROR   fleet/fleet_gateway.go:205      Could not communicate with fleet-server Checking API will retry, error: fail to decode checkin response: unexpected EOF
2021-08-04T08:39:15.153+0200    ERROR   fleet/fleet_gateway.go:205      Could not communicate with fleet-server Checking API will retry, error: fail to decode checkin response: unexpected EOF

elastig-agent hosting fleet

2021-08-04T09:05:35.803+0200    INFO    application/application.go:66   Detecting execution mode
2021-08-04T09:05:35.804+0200    INFO    application/application.go:91   Agent is managed by Fleet
2021-08-04T09:05:35.804+0200    INFO    capabilities/capabilities.go:59 capabilities file not found in /etc/elastic-agent/capabilities.yml
2021-08-04T09:05:35.968+0200    INFO    [composable.providers.docker]   docker/docker.go:43     Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2021-08-04T09:05:35.969+0200    INFO    [api]   api/server.go:62        Starting stats endpoint
2021-08-04T09:05:35.969+0200    INFO    application/managed_mode.go:291 Agent is starting
2021-08-04T09:05:35.969+0200    INFO    [api]   api/server.go:64        Metrics endpoint listening on: /var/lib/elastic-agent/data/tmp/elastic-agent.sock (configured: unix:///var/lib/elastic-agent/data/tmp/elastic-agent.sock)
2021-08-04T09:05:36.070+0200    WARN    application/managed_mode.go:304 failed to ack update open /var/lib/elastic-agent/data/.update-marker: no such file or directory
2021-08-04T09:05:36.075+0200    INFO    stateresolver/stateresolver.go:48       New State ID is U6_BZ2OW
2021-08-04T09:05:36.075+0200    INFO    stateresolver/stateresolver.go:49       Converging state requires execution of 2 step(s)
2021-08-04T09:05:36.144+0200    INFO    operation/operator.go:259       operation 'operation-install' skipped for fleet-server.7.14.0
2021-08-04T09:05:36.262+0200    INFO    log/reporter.go:40      2021-08-04T09:05:36+02:00 - message: Application: fleet-server--7.14.0[18d91be7-536a-4561-9898-8d3cd37f0ab1]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-08-04T09:05:36.576+0200    INFO    operation/operator.go:259       operation 'operation-install' skipped for filebeat.7.14.0
2021-08-04T09:05:36.972+0200    INFO    log/reporter.go:40      2021-08-04T09:05:36+02:00 - message: Application: filebeat--7.14.0--36643631373035623733363936343635[18d91be7-536a-4561-9898-8d3cd37f0ab1]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-08-04T09:05:37.309+0200    INFO    log/reporter.go:40      2021-08-04T09:05:37+02:00 - message: Application: fleet-server--7.14.0[18d91be7-536a-4561-9898-8d3cd37f0ab1]: State changed to RUNNING: Running on default policy with Fleet Server integration - type: 'STATE' - sub_type: 'RUNNING'
2021-08-04T09:05:37.524+0200    INFO    operation/operator.go:259       operation 'operation-install' skipped for metricbeat.7.14.0
2021-08-04T09:05:37.668+0200    INFO    log/reporter.go:40      2021-08-04T09:05:37+02:00 - message: Application: metricbeat--7.14.0--36643631373035623733363936343635[18d91be7-536a-4561-9898-8d3cd37f0ab1]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-08-04T09:05:37.670+0200    INFO    stateresolver/stateresolver.go:66       Updating internal state
2021-08-04T09:05:38.985+0200    INFO    log/reporter.go:40      2021-08-04T09:05:38+02:00 - message: Application: filebeat--7.14.0--36643631373035623733363936343635[18d91be7-536a-4561-9898-8d3cd37f0ab1]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-08-04T09:05:39.414+0200    INFO    log/reporter.go:40      2021-08-04T09:05:39+02:00 - message: Application: metricbeat--7.14.0--36643631373035623733363936343635[18d91be7-536a-4561-9898-8d3cd37f0ab1]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'

As Fleet-Server URL i'm using the ip adress only. Yes, i'm using self signed certificates, thats why i have to use the --insecure option in the enrollment process:

elastic-agent enroll -f --url=https://xxx.xxx.xxx.xxx:8220 --enrollment-token=xxx --insecure

Here you see the Agent information (i just cutted the hostnames). The first entry has that issue.

Kindly
Thomas

Hi, I have the same issue except that this was a fresh Elastic cloud install of 7.14.

Agents were also on 7.14 but after purging and reinstalling Elastic Agent 7.13, data started flowing in.

@jaspher As you have this on Elastic Cloud, could you send me a private message with the cluster ID so I could have a look at the logs directly?

@thleh I stumbling over the following line:

2021-08-04T08:39:15.153+0200    ERROR   fleet/fleet_gateway.go:205      Could not communicate with fleet-server Checking API will retry, error: fail to decode checkin response: unexpected EOF

Do you have anything like a proxy or anything in between?

is custom CA configured for elastic agent?
--insecure works only within enrollment process so maybe configuration is not valid.

Hi Michal,

what do you mean with invalid configuration. As far i understand the error message, the problem occurs on communication between fleet server and elastic agent.

The Fleet Server use a self-signed certificate. Thats why i used the --insecure option to register the new 7.14 elastic agent on fleet server.

The registration process itself is successful.

@ruflin: There was no change in the infrastructure. On the host where 7.13 now fails, the 7.13.x agent worked fine before.

Thomas

We have the same issue in elastic cloud with both new 7.14 agents as well as agents that were upgraded to 7.14.

Thanks everyone for the patience on this and sharing the cluster id to give us more hints on what is going on. We have found one issue that is likely related: Fleet: policy aren't assigned to agents (flaky) · Issue #27299 · elastic/beats · GitHub The short story is, a communication body is closed to early and because of this incomplete messages are sent to fleet-server. The problem happens much more often with larger policies as these take more time to read. We have an open PR to get this fixed.

@thleh As you see this always, I'm curious if by chance you have a pretty large policy?

Thanks for finding the root cause.

I can confirm this. My policy contains a lot of stuff (endpoint security, metrics, lots of logs files) on that hosts. After assigning another (small) policy to that agent it works fine.

Hopefully the fix will be included soon :slight_smile:

kindly
Thomas

Thanks for checking with a small policy and good to hear it works. We hope to have it fixed soon. Sorry for the inconvenience.

We are affected by the same issue, the only difference is that we have very small policies, but our network latency is higher (we use MPLS for some remote offices).
This is causing serious issues ATM, as we can't reliably update/publish policies to our agents, and we use those for log collection.

Hi, my initial statement that it mainly affects large policies was unfortunately not fully correct. We also managed to reproduce it with small policies on our end. In the end it is a timing issue and when the system decides to clean up things. We are planning to ship a 7.14.1 with a hopefully soonish.

@Alessandro_Berto There is one workaround which I don't like at all but if you go back to 7.13 things should work properly again.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.