Fleet Server is unstable. Can't connect new hosts but status is 'healthy'

Hello,

since end of February I was not able to add new hosts to my fleet using my fleet server. Also I do not receive data from my Windows Domain Controller anymore for some reason.
When I try to add another Windows Host it just says that the remote server 'is not ready to accept connections yet', but it does that over and over with no success. Tried the same with an Ubuntu container with the same result.
If I do curl -f http://fleet-server-ip:8220/api/status from the host it sometimes does not respond at all and sometimes it says: {"name":"fleet-server","status":"HEALTHY"}%.
Other hosts that were already added before end of February are fine and 'healthy' (besides the above mentioned DC controller) but I just can't add new ones.

ELK and all agents etc. are running version 8.0.0

I would appreciate any help, thank you

would you be able to show us the configuration on the fleet-server from the UI?
(navigate integrations-->fleet server --> fleet server settings) you will see Fleet Server and an expansion. That will show you "Max Connections" and a yaml box for other config.

How many agents do you have?

thanks

You mean that?

I have 10 agents enrolled:

pFleet was just added recently by myself in an effort to just use another fleet server but this second fleet server doesn't work ("Error: failed to communicate with Elastic Agent daemon: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/elastic-agent.sock: connect: no such file or directory").

pMinecraft 'never checked in' (that's the Ubuntu container I tried to enroll), the Windows Server PVE-WINSRV-DC2 is the DC I talked about which, as can be seen has not sent logs since end of February. The other Windows host I tried to add doesn't show up here as I couldn't enroll it in the first place.

yes that page. The default config is satisfactory for your use here. ignoring pfleet for now, what does "elastic-agent status" command give you for the PVE-WINSRV-DC2 (execute as super user on that host).

What integrations have yo installed on the default policy?

Ok did that. Seems 'HEALTHY':

This is the policy that I use for this DC (or all Windows hosts to be precise):

Do you think upgrading Endpoint Security would be worth it?
(I didn't want to risk it yet, as I have another host running this policy (Windows Host, not Server) which works just fine last time I checked)

can't tell tbh. all the agents in the "Default policy (windows)" are either offline or unhealthy. But others in Default policy are fine. What does endpoint security look like in the Default Policy?

we are working on adding more status reporting on the integrations to help with diagnostics.

The default policy is just the Windows policy minus the Windows-Logger. It also recommends to upgrade Endpoint Security there. Also pMinecraft is in the Default Policy but not fine (It's stuck at "Updating", saying it never 'checked in')

It btw seems like fleet is not sending any data besides empty TCP packets if requesting curl http://10.24.1.7:8220/api/status

10.24.1.5 is ES, 10.24.1.7 is Logstash and Fleet Server and 10.24.1.3 is the DC:

01:38:27.835299 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.7.8220 > 10.24.1.3.56625: Flags [S.], cksum 0x9f02 (correct), seq 748443443, ack 1512235924, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:28.835551 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.7.8220 > 10.24.1.3.56625: Flags [S.], cksum 0x9f02 (correct), seq 748443443, ack 1512235924, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:29.298891 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56626: Flags [S.], cksum 0x7399 (correct), seq 3175754204, ack 2289793402, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:29.731539 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56623: Flags [S.], cksum 0x39ca (correct), seq 773930180, ack 2040741478, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:30.307490 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56626: Flags [S.], cksum 0x7399 (correct), seq 3175754204, ack 2289793402, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:30.823353 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.7.8220 > 10.24.1.3.56625: Flags [S.], cksum 0x9f02 (correct), seq 748443443, ack 1512235924, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:31.015485 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56619: Flags [S.], cksum 0xf2fd (correct), seq 1619032043, ack 3708073038, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:32.298632 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56626: Flags [S.], cksum 0x7399 (correct), seq 3175754204, ack 2289793402, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:32.835490 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.7.8220 > 10.24.1.3.56625: Flags [S.], cksum 0x9f02 (correct), seq 748443443, ack 1512235924, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:34.307466 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56626: Flags [S.], cksum 0x7399 (correct), seq 3175754204, ack 2289793402, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:34.508482 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56627: Flags [S.], cksum 0x3295 (correct), seq 1381733823, ack 801968696, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0
01:38:35.523462 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    10.24.1.5.9200 > 10.24.1.3.56627: Flags [S.], cksum 0x3295 (correct), seq 1381733823, ack 801968696, win 64240, options [mss 1460,nop,nop,sackOK,nop,wscale 7], length 0

Any update?

A friend of mine just installed the whole ELK stack from scratch and he has the same problem.

Update: Completely reinstalled the system on another machine (v8.1.2 now) Still doesn't work!

@Nima_Rezainia We are still struggling with this problem. Can't enroll any agent.

What is the current state ? Are you still running on http or https ?

What are your Fleet Settings ?

Fleet still reports its healthy:

[root@pfleet ~]# curl 10.20.1.8:8220/api/status

{"name":"fleet-server","status":"HEALTHY"}

So for the Fleet Server you are using http, but for the Elasticsearch server you are using https ?

Also what is the command-line argument you are using to enroll ?

@zx8086 I used plain http for everything for my first setup this post was about. For the now used setup (that has the same issue) I use the default installation setting: https for Elasticsearch but not for Fleet or Kibana. I use the command suggested by the Kibana UI for adding host:
.\elastic-agent.exe install --url=http://10.20.1.8:8220 --enrollment-token=mytoken

@maof97

So both the fleet and Elasticsearch servers are http and set as so in the fleet setting ?

You can curl to both of them and get correct responses?

curl 10.20.1.8:8220/api/status
curl 10.20.1.9200/_cluster/health?pretty

Also if you are using only http, you should use the --insecure flag

Sometimes it's healthy sometimes I get "Connection Refused". That the problem, here ...

I think you mean curl https://10.20.1.6:9200/_cluster/health?pretty because as I stated before, Elasticsearch is using https and the rest http:

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 44,
  "active_shards" : 44,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 23,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 65.67164179104478
}

Yes I used the --insecure flag, forgot to copy that.

This sounds more like the service itself (resources - memory, cpu) or infra / connectivity.

Are you running a single client to rule out load ?

If you can monitor the endpoint you should see more of the root cause, like if you put it behind a reverse proxy load balancer and monitor the upstream.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.