Beats in elastic-agent reporting "failed to connect to backoff"

trudyc · March 15, 2022, 5:59pm

I am now getting Endpoint Security data in data streams. The solution to that problem was to fix a type in the ssl.certificate_authorities setting in Fleet Settings.

I am still experiencing the problem where no data is coming through from the auditd, system, and linux integrations.

Reiterating my scenario...
I'm using Elastic 7.17, self managed.
I've set up three elasticsearch nodes on RHEL 8 in AWS EC2. Each of these has additionally been set up as a Fleet Server.
I've set up a fourth EC2 RHEL8 host for kibana.
All four hosts have certificates signed by a certificate authority we set up using AWS Certificate Management.
We are using an AWS NLB for managing traffic, so the Fleet Settings are:
Fleet Server Hosts: https://:8220
Elasticsearch Hosts: https://:9200
On the NLB we have set up listeners for the two ports above. Each one is forwarding to a target group that is comprised of the three Elasticsearch nodes.
On the Elasticsearch EC2 instances a security group has been assigned with inbound rules for ports 8220, 9200, and 9300 all allowing TCP traffic from the VPC CIDR.
On the kibana EC2 instance a security group has been assigned with inbound rules for ports 5601 and 443 allowing https traffic from our application load balancer.

On the kibana instance in the Agent/data/elastic-agent-*/logs/default/filebeat-json.log file I see the following messages repeating:

{"log.level":"error","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":154},"message":"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": dial tcp [::1]:9200: connect: connection refused","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":145},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200)) with 94 reconnect attempt(s)","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher","log.origin":{"file.name":"pipeline/retry.go","file.line":219},"message":"retryer: send unwait signal to consumer","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"info","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"publisher","log.origin":{"file.name":"pipeline/retry.go","file.line":223},"message":"  done","service.name":"filebeat","ecs.version":"1.6.0"}
{"log.level":"error","@timestamp":"2022-03-15T17:03:41.086Z","log.logger":"esclientleg","log.origin":{"file.name":"transport/logging.go","file.line":37},"message":"Error dialing dial tcp [::1]:9200: connect: connection refused","service.name":"filebeat","network":"tcp","address":"localhost:9200","ecs.version":"1.6.0"}

On the elasticsearch nodes in the same file I see these messages repeating:

{"log.level":"error","@timestamp":"2022-03-15T17:11:58.650Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":154},"message":"Failed to connect to backoff(elasticsearch(http://localhost:9200)): Get \"http://localhost:9200\": EOF","service.name":"filebeat","ecs.version":"1.6.0"}{"log.level":"info","@timestamp":"2022-03-15T17:11:58.650Z","log.logger":"publisher_pipeline_output","log.origin":{"file.name":"pipeline/output.go","file.line":145},"message":"Attempting to reconnect to backoff(elasticsearch(http://localhost:9200)) with 113reconnect attempt(s)","service.name":"filebeat","ecs.version":"1.6.0"}

Metricbeat and Filebeat are in a perpetual state of "configuring". I've never seen this change to "healthy"

elastic-agent status
Status: HEALTHY
Message: (no message)
Applications:
  * metricbeat_monitoring  (CONFIGURING)
                           Updating configuration
  * endpoint-security      (HEALTHY)
                           Protecting with policy {bd328999-4957-44fd-9e57-75aad67d7302}
  * filebeat               (CONFIGURING)
                           Updating configuration
  * fleet-server           (HEALTHY)
                           Running on policy with Fleet Server integration: 499b5aa7-d214-5b5d-838b-3cd76469844e
  * metricbeat             (CONFIGURING)
                           Updating configuration
  * filebeat_monitoring    (CONFIGURING)
                           Updating configuration

This is the fleet.yml file on one of the elasticsearch nodes:

agent:
  id: f2f35a6f-8bbb-4a74-8d49-97424926516b
  headers: {}
  logging.level: info
  monitoring.http:
    enabled: false
    host: ""
    port: 6791
fleet:
  access_api_key: <key>
  agent:
    id: ""
  enabled: true
  host: <nlb dns name from AWS>:8220
  protocol: https
  proxy_disable: true
  reporting:
    check_frequency_sec: 30
    threshold: 10000
  server:
    host: 0.0.0.0
    internal_port: 8221
    output:
      elasticsearch:
        hosts:
        - localhost:9200
        protocol: https
        proxy_disable: false
        proxy_headers: null
        service_token: <token>
        ssl:
          certificate_authorities:
          - /etc/elasticsearch/certs/chain_cert.crt
          renegotiation: never
          verification_mode: ""
    policy:
      id: 499b5aa7-d214-5b5d-838b-3cd76469844e
    port: 8220
    ssl:
      certificate: /etc/elasticsearch/certs/<name>.crt
      key: /etc/elasticsearch/certs/<name>.key
      renegotiation: never
      verification_mode: ""
  ssl:
    certificate_authorities:
    - /etc/elasticsearch/certs/chain_cert.crt
    renegotiation: never
    verification_mode: ""
  timeout: 10m0s

And this is the fleet.yml from the kibana host:

agent:
  id: 80f20b0c-aa72-401a-a034-1bb4ca2400f7
  headers: {}
  logging.level: info
  monitoring.http:
    enabled: false
    host: ""
    port: 6791
fleet:
  access_api_key: <key>
  agent:
    id: ""
  enabled: true
  host: <nlb dns name from AWS>:8220
  hosts:
  - https://<nlb dns name from AWS>:8220
  protocol: http
  reporting:
    check_frequency_sec: 30
    threshold: 10000
  ssl:
    certificate_authorities:
    - /etc/kibana/certs/chain_cert.crt
    renegotiation: never
    verification_mode: none
  timeout: 10m0s

I believe the same problem has been posted here, though that person is working with a Windows host:

Topic		Replies	Views
Fleet agent Logs fleet	7	2866	April 13, 2022
Fleet managed Elastic Agent - Wrong ES output configuration for Filebeat/Metricbeat Beats fleet	2	1400	March 29, 2022
Do i need to install other "beats" with the elastic and endpoint agents or is something else wrong? Elastic Security	33	1846	October 8, 2021
Filebeat.sock file do not exists anymore when deploy and configure elastic-agent using Ansible Beats fleet , filebeat , elastic-agent	14	938	May 9, 2022
Fleet server no data from other elastic agents Beats fleet , elastic-agent	1	421	July 13, 2022

Beats in elastic-agent reporting "failed to connect to backoff"

Related topics