Beats in elastic-agent reporting "failed to connect to backoff"

trudyc · March 11, 2022, 11:42pm

Apologies in advance if I ask ignorant questions, or don't provide all relevant info in my post. I'm not an expert sysadmin, or network admin, or security admin. I've inherited Elastic from a departed co-worker, and have minimal understanding of how it works based on our previous architecture of installing individual beats agents. Now I am trying to figure out how to use Fleet and the unified Elastic Agent.

I'm working with Elastic v7.17.

I've set up multiple Elasticsearch hosts in AWS on RHEL8 EC2 instances, plus I've installed kibana on another RHEL8 host. I have certificates on all the hosts, generated by a certificate authority that we established using AWS ACM.

Before my co-worker departed we got as far as successfully executing the installation of the agent on these hosts to make them Fleet Server. They all show up in Fleet Agents in kibana. On each host the elastic-agent status is reporting Healthy. However, no data is showing up in Data Streams.
The agent policy applied to the hosts includes the following integrations: fleet-server, auditd, system, linux, Endpoint Security.
When I look at the logs on the server, specifically
/opt/Elastic/Agent/data/elastic-agent-*/logs/default/metricbeat-json.log and filebeat-json.log, I am find this message repeating over and over:

{"log.level":"error","@timestamp":"2022-03-11T21:46:58.160Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":154},"message":"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": EOF","service.name":"metricbeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-11T21:46:58.160Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":145},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200)) with 1 reconnect attempt(s)","service.name":"metricbeat","ecs.version":"1.6.0"}

I'm sure I haven't provided enough information to diagnose the issue yet. Again I apologize for being a total newbie, and hope somebody will take pity on me.

trudyc · March 15, 2022, 5:59pm

I am now getting Endpoint Security data in data streams. The solution to that problem was to fix a type in the ssl.certificate_authorities setting in Fleet Settings.

I am still experiencing the problem where no data is coming through from the auditd, system, and linux integrations.

Reiterating my scenario...
I'm using Elastic 7.17, self managed.
I've set up three elasticsearch nodes on RHEL 8 in AWS EC2. Each of these has additionally been set up as a Fleet Server.
I've set up a fourth EC2 RHEL8 host for kibana.
All four hosts have certificates signed by a certificate authority we set up using AWS Certificate Management.
We are using an AWS NLB for managing traffic, so the Fleet Settings are:
Fleet Server Hosts: https://:8220
Elasticsearch Hosts: https://:9200
On the NLB we have set up listeners for the two ports above. Each one is forwarding to a target group that is comprised of the three Elasticsearch nodes.
On the Elasticsearch EC2 instances a security group has been assigned with inbound rules for ports 8220, 9200, and 9300 all allowing TCP traffic from the VPC CIDR.
On the kibana EC2 instance a security group has been assigned with inbound rules for ports 5601 and 443 allowing https traffic from our application load balancer.

On the kibana instance in the Agent/data/elastic-agent-*/logs/default/filebeat-json.log file I see the following messages repeating:

{"log.level":"error","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":154},"message":"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": dial tcp [::1]:9200: connect: connection refused","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":145},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200)) with 94 reconnect attempt(s)","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher","log.origin":{"file.name":"pipeline/retry.go","file.line":219},"message":"retryer: send unwait signal to consumer","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher","log.origin":{"file.name":"pipeline/retry.go","file.line":223},"message":"  done","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"esclientleg","log.origin":{"file.name":"transport/logging.go","file.line":37},"message":"Error dialing dial tcp [::1]:9200: connect: connection refused","service.name":"filebeat","network":"tcp","address":"localhost:9200","ecs.version":"1.6.0"}

On the elasticsearch nodes in the same file I see these messages repeating:

{"log.level":"error","@timestamp":"2022-03-15T17:11:58.650Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":154},"message":"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": EOF","service.name":"filebeat","ecs.version":"1.6.0"}{"log.level":"info","@timestamp":"2022-03-15T17:11:58.650Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":145},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200)) with 113reconnect attempt(s)","service.name":"filebeat","ecs.version":"1.6.0"}

Metricbeat and Filebeat are in a perpetual state of "configuring". I've never seen this change to "healthy"

elastic-agent status
Status: HEALTHY
Message: (no message)
Applications:
  * metricbeat_monitoring  (CONFIGURING)
                           Updating configuration
  * endpoint-security      (HEALTHY)
                           Protecting with policy {bd328999-4957-44fd-9e57-75aad67d7302}
  * filebeat               (CONFIGURING)
                           Updating configuration
  * fleet-server           (HEALTHY)
                           Running on policy with Fleet Server integration: 499b5aa7-d214-5b5d-838b-3cd76469844e
  * metricbeat             (CONFIGURING)
                           Updating configuration
  * filebeat_monitoring    (CONFIGURING)
                           Updating configuration

This is the fleet.yml file on one of the elasticsearch nodes:

agent:
  id: f2f35a6f-8bbb-4a74-8d49-97424926516b
  headers: {}
  logging.level: info
  monitoring.http:
    enabled: false
    host: ""
    port: 6791
fleet:
  access_api_key: <key>
  agent:
    id: ""
  enabled: true
  host: <nlb dns name from AWS>:8220
  protocol: https
  proxy_disable: true
  reporting:
    check_frequency_sec: 30
    threshold: 10000
  server:
    host: 0.0.0.0
    internal_port: 8221
    output:
      elasticsearch:
        hosts:
        - localhost:9200
        protocol: https
        proxy_disable: false
        proxy_headers: null
        service_token: <token>
        ssl:
          certificate_authorities:
          - /etc/elasticsearch/certs/chain_cert.crt
          renegotiation: never
          verification_mode: ""
    policy:
      id: 499b5aa7-d214-5b5d-838b-3cd76469844e
    port: 8220
    ssl:
      certificate: /etc/elasticsearch/certs/<name>.crt
      key: /etc/elasticsearch/certs/<name>.key
      renegotiation: never
      verification_mode: ""
  ssl:
    certificate_authorities:
    - /etc/elasticsearch/certs/chain_cert.crt
    renegotiation: never
    verification_mode: ""
  timeout: 10m0s

And this is the fleet.yml from the kibana host:

agent:
  id: 80f20b0c-aa72-401a-a034-1bb4ca2400f7
  headers: {}
  logging.level: info
  monitoring.http:
    enabled: false
    host: ""
    port: 6791
fleet:
  access_api_key: <key>
  agent:
    id: ""
  enabled: true
  host: <nlb dns name from AWS>:8220
  hosts:
  - https://<nlb dns name from AWS>:8220
  protocol: http
  reporting:
    check_frequency_sec: 30
    threshold: 10000
  ssl:
    certificate_authorities:
    - /etc/kibana/certs/chain_cert.crt
    renegotiation: never
    verification_mode: none
  timeout: 10m0s

I believe the same problem has been posted here, though that person is working with a Windows host:

trudyc · March 16, 2022, 8:11pm

I found a work around, but still am working to find out why this was necessary.

As noted earlier in the Elastic/Agent/data/elastic-agent-*/logs/default/filebeat-json.log file (and the metricbeat-json.log file) I was seeing this message repeating.

{"log.level":"error","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":154},"message":"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": dial tcp [::1]:9200: connect: connection refused","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":145},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200)) with 94 reconnect attempt(s)","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher","log.origin":{"file.name":"pipeline/retry.go","file.line":219},"message":"retryer: send unwait signal to consumer","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher","log.origin":{"file.name":"pipeline/retry.go","file.line":223},"message":"  done","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"esclientleg","log.origin":{"file.name":"transport/logging.go","file.line":37},"message":"Error dialing dial tcp [::1]:9200: connect: connection refused","service.name":"filebeat","network":"tcp","address":"localhost:9200","ecs.version":"1.6.0"}

In my Fleet Settings I had specified that the Elasticsearch hosts URL was https://:9200 but in these message it shows that it is trying to connect to http://localhost:9200

I confirmed that the fleet.yml file was correct. The only thing in the elastic-agent.yml file was

fleet:
  enabled: true

Therefore I would expect that the filebeat and metricbeat ymls would be picking up the fleet configuration.

Looking at the the filebeat.yml file (at Elastic/Agent/data/elastic-agent-*/install/filebeat-7.17.0-linux-x86_64 I found this:

 ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["localhost:9200"]

  # Protocol - either `http` (default) or `https`.
  #protocol: "https"

  # Authentication credentials - either API key or username/password.
  #api_key: "id:api_key"
  #username: "elastic"
  #password: "changeme"

I updated that file to this:

 ---------------------------- Elasticsearch Output ----------------------------
output.elasticsearch:
  # Array of hosts to connect to.
  hosts: ["<nlb dns name:9200"]

  # Protocol - either `http` (default) or `https`.
  protocol: "https"

  # Authentication credentials - either API key or username/password.
  #api_key: "id:api_key"
  username: "elastic"
  password: "<actual password"
  ssl.verification_mode: none

After that update and restarting the elastic-agent, data began appearing in the data streams for the OS & System integrations that I had added to the policy for that host.

So, the question is, why isn't filebeat picking up the configuration from the fleet settings?

VamPikmin · March 17, 2022, 12:01am

So with the versions before 8.1.0 I had to manually change the beats yml which is not mentioned anywhere in the official guides.

Even after making the change and restarting the agent it would send through some data and then stop working

The elastic-agent would show configuring, instead of healthy

I've wiped the installation and started from scratch in 8.1.0 and now finally it's working without changing the beats YML manually

I haven't done anything different than before when I was configuring the fleet using 7.x.x and 8.0.x other than running the install command on an actual Windows 2016 Server as opposed to Windows 10 enterprise VM

PS C:\Program Files\Elastic\Agent> .\elastic-agent.exe status
Status: HEALTHY
Message: (no message)
Applications:
  * filebeat               (HEALTHY)
                           Running
  * metricbeat             (HEALTHY)
                           Running
  * filebeat_monitoring    (HEALTHY)
                           Running
  * metricbeat_monitoring  (HEALTHY)
                           Running
PS C:\Program Files\Elastic\Agent>

I'm running fleet server on Linux but planning to monitor Windows Server VMs

PS C:\Users\Pikmin\Downloads\elastic-agent-8.1.0-windows-x86_64> .\elastic-agent.exe install --url=https://192.168.131.155:8220 --enrollment-token=TOKENSTRING --certificate-authorities=C:\Us
ers\Pikmin\Downloads\fleet-server.crt
Elastic Agent will be installed at C:\Program Files\Elastic\Agent and will run as a service. Do you want to continue? [Y/n]:
{"log.level":"info","@timestamp":"2022-03-16T20:53:13.571+1100","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":455},"message":"Starting enrollment to URL: https://192.168.131.155:8220/","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-16T20:53:14.459+1100","log.origin":{"file.name":"cmd/enroll_cmd.go","file.line":255},"message":"Successfully triggered restart on running Elastic Agent.","ecs.version":"1.6.0"}
Successfully enrolled the Elastic Agent.
Elastic Agent has been successfully installed.

Will probably add Linux servers at some point too, still in testing phase.

Would be nice if someone from Elastic could respond, would prefer to know if this is a bug in previous versions or something that we both missed during config (which now works for me in 8.1.0)

trudyc · March 18, 2022, 7:55pm

There was a typo in my post concerning the work around:

That should say that in the Fleet Settings I had specified that the Elasticsearch hosts URL was https://nlb-dns-name-from-aws:9200

trudyc · April 13, 2022, 5:16pm

Closing the loop on this.

I worked with Elastic Support to find the issue. Eventually they pointed out that our settings in Fleet Settings > Elasticsearch Output Configuration (YAML) were wrong.

My former co-worker who had started the implementation work for Elastic Agent and Fleet Management had put into that setting:

username: elastic
password: ****
ssl.certificate_authorities: ['<path to cert chain>']

He had told us before he left that information was required.

Elastic Support told me that the credentials should not be included, and that Fleet would manage the required API keys. So, I removed the credentials from the setting, leaving the ssl.certificate_authorities setting, and data began flowing for the integrations that leverage filebeat and metricbeat.

When looking at the elastic-agent status output, it now shows each item in a HEALTHY state instead of stuck in CONFIGURING. And we no longer see the errors in the logs that I mentioned in my original post in this thread.

Topic		Replies	Views
Fleet agent Logs fleet	6	2943	March 16, 2022
Fleet Server - Filebeat Using HTTP instead of HTTPS Elastic Agent fleet	8	1815	February 21, 2023
Fleet managed Elastic Agent - Wrong ES output configuration for Filebeat/Metricbeat Beats fleet	1	1465	March 1, 2022
Fleet server no data from other elastic agents Beats fleet , elastic-agent	0	434	June 15, 2022
Filebeat.sock file do not exists anymore when deploy and configure elastic-agent using Ansible Beats fleet , filebeat , elastic-agent	13	994	April 11, 2022

Beats in elastic-agent reporting "failed to connect to backoff"

Related topics