ElasticFleet randomly missing logs

RichardMatthews · September 22, 2021, 10:28pm

Hey,

We are running the Elasticsearch Operator on Kubernetes using Elasticsearch (7.14.1), Kibana (7.14.1) and ElasticAgent (7.14.1) in Fleet mode. We have noticed that we are missing logs intermittently from the agents, which for us is breaking our alerting system. We have noticed this because we are migrating from the AWS Managed Elasticsearch to a self hosted Elasticsearch and we have some alerts happening on the old system but not the new system due to missing log entries.

Alot of our agents have this error in them:

E0907 03:36:03.682018      15 leaderelection.go:325] error retrieving resource lock elasticsearch/elastic-agent-cluster-leader: etcdserver: request timed out
E0907 03:55:48.134293      15 leaderelection.go:325] error retrieving resource lock elasticsearch/elastic-agent-cluster-leader: Get "https://100.64.0.1:443/apis/coordination.k8s.io/v1/namespaces/elasticsearch/leases/elastic-agent-cluster-leader": read tcp 100.112.204.130:37598->100.64.0.1:443: read: connection timed out
2021-09-07T04:55:14.959Z    ERROR    fleet/fleet_gateway.go:205    Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping
2021-09-07T05:32:59.558Z    ERROR    fleet/fleet_gateway.go:205    Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/cf5e8dbf-bcaf-467a-97b8-2e07f576d42f/checkin?": read tcp 100.112.204.130:57296->100.64.233.135:8220: read: connection timed out
2021-09-07T05:34:52.554Z    ERROR    fleet/fleet_gateway.go:205    Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/cf5e8dbf-bcaf-467a-97b8-2e07f576d42f/checkin?": EOF
E0908 01:19:20.097951      15 leaderelection.go:325] error retrieving resource lock elasticsearch/elastic-agent-cluster-leader: etcdserver: request timed out

Any help us much appreciated,
Richard

ruflin · September 23, 2021, 11:07am

Any idea on where these connection errors could come from? tcp 100.112.204.130:57296->100.64.233.135:8220: read: connection timed out Also the EOF error looks like something that happened in 7.14.0 but should not show up in 7.14.1. You don't have any 7.14.0 agents left by chance?

What you mean by missing logs. The logs from Elastic Agent itself or the logs that Elastic Agent collects?

Can you share the policy you are using and some more details on which logs are missing? Are you missing certain events or complete log files?

RichardMatthews · September 27, 2021, 1:12am

Hi Ruflin,

Thanks for your reply! 100.64.233.135 is the fleet-server-agent-http service. We don't have any fleet server agents left running 7.14.0.

Yes the logs that the elastic agent collects are intermittently missing.

It looks like we are missing complete log files, here is our policy.

id: 970bba30-046c-11ec-83fd-5f899723f07b
revision: 17
outputs:
  default:
    type: elasticsearch
    hosts:
      - 'https://elasticsearch-es-http.elasticsearch.svc:9200'
output_permissions:
  default:
    log-1:
      indices:
        - names:
            - logs-generic-default
          privileges:
            - auto_configure
            - create_doc
    kubernetes-1:
      indices:
        - names:
            - metrics-kubernetes.pod-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.node-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.system-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.volume-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.container-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.state_container-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.apiserver-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.proxy-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-kubernetes.event-default
          privileges:
            - auto_configure
            - create_doc
    apm-1:
      indices:
        - names:
            - metrics-apm.app.*-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - logs-apm.error-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-apm.internal-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - metrics-apm.profiling-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - traces-sampled-default
          privileges:
            - auto_configure
            - create_doc
        - names:
            - traces-apm-default
          privileges:
            - auto_configure
            - create_doc
    _elastic_agent_checks:
      cluster:
        - monitor
      indices:
        - names:
            - metrics-elastic_agent-default
            - metrics-elastic_agent.elastic_agent-default
            - metrics-elastic_agent.apm_server-default
            - metrics-elastic_agent.filebeat-default
            - metrics-elastic_agent.fleet_server-default
            - metrics-elastic_agent.metricbeat-default
            - metrics-elastic_agent.osquerybeat-default
            - metrics-elastic_agent.packetbeat-default
            - metrics-elastic_agent.endpoint_security-default
            - metrics-elastic_agent.auditbeat-default
            - metrics-elastic_agent.heartbeat-default
          privileges:
            - auto_configure
            - create_doc
agent:
  monitoring:
    enabled: true
    use_output: default
    namespace: default
    logs: false
    metrics: true
inputs:
  - id: f982bd0c-3b41-4849-8f8d-1b990002403f
    name: log-1
    revision: 6
    type: logfile
    use_output: default
    meta:
      package:
        name: log
        version: 0.4.6
    data_stream:
      namespace: default
    streams:
      - id: logfile-log.log-f982bd0c-3b41-4849-8f8d-1b990002403f
        data_stream:
          dataset: generic
        condition: >-
          ${kubernetes.namespace} == 'production' OR ${kubernetes.namespace} ==
          'staging'
        paths:
          - '/var/log/containers/*${kubernetes.container.id}.log'
        symlinks: true
  - id: 2e300b9c-f888-4a72-b588-f7e96884c25b
    name: kubernetes-1
    revision: 6
    type: kubernetes/metrics
    use_output: default
    meta:
      package:
        name: kubernetes
        version: 0.8.0
    data_stream:
      namespace: default
    streams:
      - id: kubernetes/metrics-kubernetes.pod-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.pod
          type: metrics
        period: 10s
        add_metadata: true
        hosts:
          - 'https://${env.NODE_NAME}:10250'
        ssl.verification_mode: none
        metricsets:
          - pod
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      - id: >-
          kubernetes/metrics-kubernetes.node-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.node
          type: metrics
        period: 10s
        add_metadata: true
        hosts:
          - 'https://${env.NODE_NAME}:10250'
        ssl.verification_mode: none
        metricsets:
          - node
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      - id: >-
          kubernetes/metrics-kubernetes.system-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.system
          type: metrics
        period: 10s
        add_metadata: true
        hosts:
          - 'https://${env.NODE_NAME}:10250'
        ssl.verification_mode: none
        metricsets:
          - system
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      - id: >-
          kubernetes/metrics-kubernetes.volume-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.volume
          type: metrics
        period: 10s
        add_metadata: true
        hosts:
          - 'https://${env.NODE_NAME}:10250'
        ssl.verification_mode: none
        metricsets:
          - volume
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
      - id: >-
          kubernetes/metrics-kubernetes.container-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.container
          type: metrics
        period: 10s
        add_metadata: true
        hosts:
          - 'https://${env.NODE_NAME}:10250'
        ssl.verification_mode: none
        metricsets:
          - container
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  - id: 2e300b9c-f888-4a72-b588-f7e96884c25b
    name: kubernetes-1
    revision: 6
    type: kubernetes/metrics
    use_output: default
    meta:
      package:
        name: kubernetes
        version: 0.8.0
    data_stream:
      namespace: default
    streams:
      - id: >-
          kubernetes/metrics-kubernetes.state_container-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.state_container
          type: metrics
        period: 10s
        condition: '${kubernetes_leaderelection.leader} == true'
        add_metadata: true
        hosts:
          - 'http://metricbeat-kube-state-metrics.default.svc.cluster.local:8080'
        metricsets:
          - state_container
  - id: 2e300b9c-f888-4a72-b588-f7e96884c25b
    name: kubernetes-1
    revision: 6
    type: kubernetes/metrics
    use_output: default
    meta:
      package:
        name: kubernetes
        version: 0.8.0
    data_stream:
      namespace: default
    streams:
      - id: >-
          kubernetes/metrics-kubernetes.apiserver-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.apiserver
          type: metrics
        period: 30s
        condition: '${kubernetes_leaderelection.leader} == true'
        hosts:
          - >-
            https://${env.KUBERNETES_SERVICE_HOST}:${env.KUBERNETES_SERVICE_PORT}
        metricsets:
          - apiserver
        ssl.certificate_authorities:
          - /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
  - id: 2e300b9c-f888-4a72-b588-f7e96884c25b
    name: kubernetes-1
    revision: 6
    type: kubernetes/metrics
    use_output: default
    meta:
      package:
        name: kubernetes
        version: 0.8.0
    data_stream:
      namespace: default
    streams:
      - id: >-
          kubernetes/metrics-kubernetes.proxy-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.proxy
          type: metrics
        period: 10s
        hosts:
          - 'localhost:10249'
        metricsets:
          - proxy
  - id: 2e300b9c-f888-4a72-b588-f7e96884c25b
    name: kubernetes-1
    revision: 6
    type: kubernetes/metrics
    use_output: default
    meta:
      package:
        name: kubernetes
        version: 0.8.0
    data_stream:
      namespace: default
    streams:
      - id: >-
          kubernetes/metrics-kubernetes.event-2e300b9c-f888-4a72-b588-f7e96884c25b
        data_stream:
          dataset: kubernetes.event
          type: metrics
        period: 10s
        condition: '${kubernetes_leaderelection.leader} == true'
        add_metadata: true
        metricsets:
          - event
  - id: b31b5105-6ca2-429b-b10f-bb3ec1610399
    name: apm-1
    revision: 2
    type: apm
    use_output: default
    meta:
      package:
        name: apm
        version: 0.3.0
    data_stream:
      namespace: default
    apm-server:
      capture_personal_data: true
      max_connections: 0
      max_event_size: 307200
      default_service_environment: null
      shutdown_timeout: 30s
      rum:
        enabled: true
        exclude_from_grouping: ^/webpack
        allow_headers: null
        response_headers: null
        event_rate.lru_size: 10000
        library_pattern: node_modules|bower_components|~
        allow_origins:
          - '*'
        allow_service_names: null
        event_rate.limit: 10
        source_mapping:
          metadata: []
      secret_token: null
      response_headers: null
      api_key:
        enabled: false
        limit: 100
      write_timeout: 30s
      host: '0.0.0.0:8200'
      max_header_size: 1048576
      idle_timeout: 45s
      expvar.enabled: false
      read_timeout: 3600s
      agent_config: []
fleet:
  hosts:
    - 'https://fleet-server-agent-http.elasticsearch.svc:8220'

Any help is much appreciated!

jamesarobertson · September 28, 2021, 8:27pm

Hi @ruflin James Robertson here from the The National Library of New Zealand. We'd really appreciate a resolution on this issue soon as it is seriously disrupting our adoption of Elastic Stack. Thanks heaps for any assistance.

ruflin · September 29, 2021, 2:48pm

Could you share the log files from the Elastic Agent with the fleet-server including the logs from the fleet-server process itself? How many Elastic Agents have you deployed? I wonder if the config must be adjusted.

This would still not explain why some logs are missing. So the logs you are referring to and are missing are the kubernetes logs? Some logs from the containers are missing?

Do all the Elastic Agents show up healthy in the Fleet UI?

RichardMatthews · September 29, 2021, 8:55pm

Hi @ruflin, thanks for your reply.

Yes all of our elastic agents show as healthy. We have 36 elastic agents deployed.

Yes some logs from our containers are missing. All of our applications log to standard out and then the elastic agents are forwarding those logs to the fleet server process.

I have added below some log examples.

Is there potential that we have too many agents for our 1 fleet server?

Any help is much appreciated,
Richard

RichardMatthews · September 29, 2021, 8:56pm

Elastic Agent Log:

Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_component_template/.fleet_component_template-1]"... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_ingest/pipeline/.fleet_final_pipeline-1]"}
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"An internal server error occurred"}
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"An internal server error occurred"}
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"An internal server error occurred"}
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_component_template/.fleet_component_template-1]"... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_ingest/pipeline/.fleet_final_pipeline-1]"}
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_component_template/.fleet_component_template-1]"... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_ingest/pipeline/.fleet_final_pipeline-1]"}
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_component_template/.fleet_component_template-1]"... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"An internal server error occurred"}
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":401,"error":"Unauthorized","message":"[security_exception: [security_exception] Reason: unable to authenticate user [elasticsearch-elastic-agent-agent-kb-user] for REST request [/_security/_authenticate]]: unable to authenticate user [e... (truncated)
Kibana Fleet setup failed: http POST request to http://kibana-kb-http.elasticsearch.svc:5601/api/fleet/setup fails: <nil>. Response: {"statusCode":500,"error":"Internal Server Error","message":"An internal server error occurred"}
Policy selected for enrollment:  970bba30-046c-11ec-83fd-5f899723f07b
2021-09-14T02:03:54.847Z	INFO	cmd/enroll_cmd.go:396	Starting enrollment to URL: https://fleet-server-agent-http.elasticsearch.svc:8220/
2021-09-14T02:03:55.638Z	INFO	cmd/enroll_cmd.go:232	Elastic Agent might not be running; unable to trigger restart
Successfully enrolled the Elastic Agent.
2021-09-14T02:03:55.639Z	INFO	cmd/enroll_cmd.go:234	Successfully triggered restart on running Elastic Agent.
2021-09-14T02:03:55.744Z	INFO	application/application.go:66	Detecting execution mode
2021-09-14T02:03:55.746Z	INFO	application/application.go:91	Agent is managed by Fleet
2021-09-14T02:03:55.746Z	INFO	capabilities/capabilities.go:59	capabilities file not found in /usr/share/elastic-agent/capabilities.yml
I0914 02:03:56.454668      19 leaderelection.go:243] attempting to acquire leader lease  elasticsearch/elastic-agent-cluster-leader...
2021-09-14T02:03:56.456Z	INFO	[composable.providers.kubernetes]	kubernetes/kubernetes.go:64	Kubernetes provider started with node scope
2021-09-14T02:03:56.456Z	INFO	[composable.providers.kubernetes]	kubernetes/util.go:114	kubernetes: Using pod name elastic-agent-agent-42gmr and namespace elasticsearch to discover kubernetes node
2021-09-14T02:03:56.535Z	INFO	[composable.providers.kubernetes]	kubernetes/util.go:120	kubernetes: Using node ip-172-32-0-185.ap-southeast-2.compute.internal discovered by in cluster pod node query
2021-09-14T02:03:56.736Z	INFO	[composable.providers.docker]	docker/docker.go:43	Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2021-09-14T02:03:56.741Z	INFO	[api]	api/server.go:62	Starting stats endpoint

RichardMatthews · September 29, 2021, 8:56pm

Elastic Agent Log:

2021-09-14T02:03:56.741Z	INFO	application/managed_mode.go:291	Agent is starting
2021-09-14T02:03:56.741Z	INFO	[api]	api/server.go:64	Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
2021-09-14T02:03:58.334Z	INFO	stateresolver/stateresolver.go:48	New State ID is j0dQ4X0h
2021-09-14T02:03:58.335Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-14T02:04:09.546Z	INFO	log/reporter.go:40	2021-09-14T02:04:09Z - message: Application: apm-server--7.14.1[64a32d48-c598-44b2-86dd-1a2a7605291b]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-14T02:04:11.939Z	INFO	log/reporter.go:40	2021-09-14T02:04:11Z - message: Application: apm-server--7.14.1[64a32d48-c598-44b2-86dd-1a2a7605291b]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-09-14T02:04:23.748Z	INFO	log/reporter.go:40	2021-09-14T02:04:23Z - message: Application: metricbeat--7.14.1[64a32d48-c598-44b2-86dd-1a2a7605291b]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-14T02:04:27.538Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-14T02:04:27.746Z	INFO	log/reporter.go:40	2021-09-14T02:04:27Z - message: Application: metricbeat--7.14.1[64a32d48-c598-44b2-86dd-1a2a7605291b]: State changed to CONFIG: Updating configuration - type: 'STATE' - sub_type: 'CONFIG'
2021-09-14T02:04:28.335Z	INFO	log/reporter.go:40	2021-09-14T02:04:28Z - message: Application: metricbeat--7.14.1[64a32d48-c598-44b2-86dd-1a2a7605291b]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-09-14T02:04:28.639Z	INFO	log/reporter.go:40	2021-09-14T02:04:28Z - message: Application: metricbeat--7.14.1--36643631373035623733363936343635[64a32d48-c598-44b2-86dd-1a2a7605291b]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-14T02:04:28.842Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-14T02:04:31.449Z	INFO	log/reporter.go:40	2021-09-14T02:04:31Z - message: Application: metricbeat--7.14.1--36643631373035623733363936343635[64a32d48-c598-44b2-86dd-1a2a7605291b]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-09-14T04:58:35.459Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping
2021-09-14T04:59:47.454Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": dial tcp 100.64.233.135:8220: connect: connection refused
2021-09-14T05:30:26.716Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping
2021-09-14T05:32:01.995Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": dial tcp 100.64.233.135:8220: connect: connection refused
2021-09-14T05:34:54.515Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": dial tcp 100.64.233.135:8220: connect: connection refused
2021-09-14T07:27:04.805Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": EOF
2021-09-14T07:28:39.265Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": dial tcp 100.64.233.135:8220: connect: connection refused
2021-09-14T07:32:29.103Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping
2021-09-15T02:53:35.832Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping
2021-09-15T05:30:30.518Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: status code: 503, fleet-server returned an error: ServiceUnavailable, message: server is stopping
2021-09-15T05:32:13.127Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": dial tcp 100.64.233.135:8220: connect: connection refused
2021-09-15T05:36:13.599Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": EOF
2021-09-21T04:03:31.540Z	INFO	stateresolver/stateresolver.go:48	New State ID is j0dQ4X0h
2021-09-21T04:03:31.540Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 2 step(s)
2021-09-21T04:03:33.145Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-21T04:03:33.145Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-21T04:03:35.148Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-21T04:03:35.148Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-21T04:03:35.151Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-21T04:03:35.448Z	INFO	stateresolver/stateresolver.go:48	New State ID is j0dQ4X0h
2021-09-21T04:03:35.448Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 2 step(s)
2021-09-21T04:03:37.352Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-21T04:03:37.352Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-21T04:03:39.253Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-21T04:03:39.253Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-21T04:03:39.335Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-26T19:47:12.239Z	INFO	stateresolver/stateresolver.go:48	New State ID is j0dQ4X0h
2021-09-26T19:47:12.239Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 2 step(s)
2021-09-26T19:47:13.855Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-26T19:47:13.855Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-26T19:47:15.749Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-26T19:47:15.749Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-26T19:47:15.752Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-26T19:47:16.133Z	INFO	stateresolver/stateresolver.go:48	New State ID is j0dQ4X0h
2021-09-26T19:47:16.133Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 2 step(s)
2021-09-26T19:47:17.941Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-26T19:47:17.942Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-26T19:47:19.938Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-26T19:47:19.938Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-26T19:47:19.942Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-27T22:21:41.759Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": read tcp 100.107.127.131:59720->100.64.233.135:8220: read: connection timed out
2021-09-28T02:30:17.089Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": read tcp 100.107.127.131:37656->100.64.233.135:8220: read: connection timed out
2021-09-28T02:32:13.789Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/64a32d48-c598-44b2-86dd-1a2a7605291b/checkin?": dial tcp 100.64.233.135:8220: connect: connection refuse

RichardMatthews · September 29, 2021, 8:57pm

Elastic Agent Log:

2021-09-27T22:21:33.151Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/29c3c96f-c068-473d-9d0c-3f223117acf0/checkin?": read tcp 100.117.114.195:45410->100.64.233.135:8220: read: connection timed out
2021-09-28T01:39:29.856Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T01:39:29.856Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T01:39:31.557Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T01:39:31.557Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T01:39:32.866Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T01:39:32.866Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T01:39:34.466Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T01:39:34.466Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T01:39:34.550Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T01:39:34.883Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T01:39:34.883Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T01:39:37.162Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T01:39:37.162Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T01:39:38.853Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T01:39:38.853Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T01:39:40.851Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T01:39:40.851Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T01:39:40.855Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:30:15.899Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/29c3c96f-c068-473d-9d0c-3f223117acf0/checkin?": read tcp 100.117.114.195:51820->100.64.233.135:8220: read: connection timed out
2021-09-28T02:31:41.827Z	ERROR	fleet/fleet_gateway.go:205	Could not communicate with fleet-server Checking API will retry, error: fail to checkin to fleet-server: Post "https://fleet-server-agent-http.elasticsearch.svc:8220/api/fleet/agents/29c3c96f-c068-473d-9d0c-3f223117acf0/checkin?": dial tcp 100.64.233.135:8220: connect: connection refused
2021-09-28T03:29:58.447Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T03:29:58.447Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T03:30:00.555Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T03:30:00.555Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T03:30:01.866Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T03:30:01.866Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T03:30:03.962Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T03:30:04.047Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T03:30:04.052Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T03:30:04.355Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T03:30:04.355Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T03:30:06.551Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T03:30:06.554Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T03:30:09.119Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T03:30:09.119Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T03:30:11.249Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T03:30:11.249Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T03:30:11.270Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T04:22:22.958Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T04:22:23.046Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T04:22:25.058Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:22:25.058Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:22:26.260Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T04:22:26.260Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T04:22:28.349Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:22:28.349Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:22:28.447Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T04:22:28.751Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T04:22:28.751Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T04:22:30.651Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:22:30.651Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:22:32.055Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T04:22:32.055Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T04:22:33.650Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:22:33.650Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:22:33.654Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T04:22:33.958Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T04:22:33.958Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T04:22:35.552Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:22:35.552Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:22:37.352Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T04:22:37.352Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T04:22:39.149Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:22:39.149Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:22:39.159Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T04:27:54.760Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T04:27:54.760Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T04:27:56.349Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:27:56.349Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:27:57.962Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T04:27:57.962Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T04:27:59.554Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:27:59.555Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:27:59.560Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T04:28:00.160Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T04:28:00.160Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T04:28:02.548Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:28:02.548Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:28:04.250Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T04:28:04.250Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T04:28:06.848Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:28:06.848Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:28:06.854Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T04:28:07.258Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T04:28:07.259Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T04:28:09.887Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:28:09.887Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:28:12.252Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T04:28:12.252Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T04:28:14.448Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:28:14.448Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:28:14.455Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T04:29:12.350Z	INFO	stateresolver/stateresolver.go:48	New State ID is trTLe1x6
2021-09-28T04:29:12.350Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 3 step(s)
2021-09-28T04:29:14.053Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:29:14.053Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:29:15.365Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T04:29:15.365Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T04:29:17.356Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T04:29:17.356Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T04:29:17.360Z	INFO	stateresolver/stateresolver.go:66	Updating internal state

RichardMatthews · September 29, 2021, 8:57pm

Elastic Fleet Log:

Elastic Fleet Log

Performing setup of Fleet in Kibana

Policy selected for enrollment:
2021-09-28T02:32:04.027Z	INFO	cmd/enroll_cmd.go:508	Spawning Elastic Agent daemon as a subprocess to complete bootstrap process.
2021-09-28T02:32:04.438Z	INFO	application/application.go:66	Detecting execution mode
2021-09-28T02:32:04.439Z	INFO	application/application.go:87	Agent is in Fleet Server bootstrap mode
2021-09-28T02:32:05.093Z	INFO	cmd/enroll_cmd.go:650	Waiting for Elastic Agent to start Fleet Server
2021-09-28T02:32:05.503Z	INFO	[api]	api/server.go:62	Starting stats endpoint
2021-09-28T02:32:05.504Z	INFO	application/fleet_server_bootstrap.go:124	Agent is starting
2021-09-28T02:32:05.513Z	INFO	application/fleet_server_bootstrap.go:134	Agent is stopped
2021-09-28T02:32:05.504Z	INFO	[api]	api/server.go:64	Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
2021-09-28T02:32:05.583Z	INFO	stateresolver/stateresolver.go:48	New State ID is iH6TmHZb
2021-09-28T02:32:05.584Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 1 step(s)
2021-09-28T02:32:09.388Z	INFO	log/reporter.go:40	2021-09-28T02:32:09Z - message: Application: fleet-server--7.14.1[]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-28T02:32:09.390Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:32:10.306Z	INFO	cmd/enroll_cmd.go:683	Fleet Server - Starting
2021-09-28T02:32:11.092Z	WARN	status/reporter.go:236	Elastic Agent status changed to: 'degraded'
2021-09-28T02:32:11.093Z	INFO	log/reporter.go:40	2021-09-28T02:32:11Z - message: Application: fleet-server--7.14.1[]: State changed to DEGRADED: Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process) - type: 'STATE' - sub_type: 'RUNNING'
2021-09-28T02:32:11.307Z	INFO	cmd/enroll_cmd.go:664	Fleet Server - Running on default policy with Fleet Server integration; missing config fleet.agent.id (expected during bootstrap process)
2021-09-28T02:32:11.383Z	INFO	cmd/enroll_cmd.go:396	Starting enrollment to URL: https://fleet-server-agent-http.elasticsearch.svc:8220/
2021-09-28T02:32:12.210Z	INFO	cmd/enroll_cmd.go:237	Elastic Agent has been enrolled; start Elastic Agent
2021-09-28T02:32:12.211Z	INFO	cmd/run.go:189	Shutting down Elastic Agent and sending last events...
2021-09-28T02:32:12.211Z	INFO	operation/operator.go:192	waiting for installer of pipeline 'default' to finish
2021-09-28T02:32:12.211Z	INFO	process/app.go:176	Signaling application to stop because of shutdown: fleet-server--7.14.1
Successfully enrolled the Elastic Agent.
2021-09-28T02:32:13.712Z	INFO	status/reporter.go:236	Elastic Agent status changed to: 'online'
2021-09-28T02:32:13.712Z	INFO	log/reporter.go:40	2021-09-28T02:32:13Z - message: Application: fleet-server--7.14.1[]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-09-28T02:32:13.712Z	INFO	cmd/run.go:197	Shutting down completed.
2021-09-28T02:32:13.712Z	INFO	[api]	api/server.go:66	Stats endpoint (/usr/share/elastic-agent/state/data/tmp/elastic-agent.sock) finished: accept unix /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock: use of closed network connection
2021-09-28T02:32:13.822Z	INFO	application/application.go:66	Detecting execution mode
2021-09-28T02:32:13.824Z	INFO	application/application.go:91	Agent is managed by Fleet
2021-09-28T02:32:13.824Z	INFO	capabilities/capabilities.go:59	capabilities file not found in /usr/share/elastic-agent/capabilities.yml
I0928 02:32:15.186747      15 leaderelection.go:243] attempting to acquire leader lease  elasticsearch/elastic-agent-cluster-leader...
2021-09-28T02:32:15.288Z	INFO	[composable.providers.docker]	docker/docker.go:43	Docker provider skipped, unable to connect: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
2021-09-28T02:32:15.290Z	INFO	[composable.providers.kubernetes]	kubernetes/kubernetes.go:64	Kubernetes provider started with node scope
2021-09-28T02:32:15.290Z	INFO	[composable.providers.kubernetes]	kubernetes/util.go:114	kubernetes: Using pod name fleet-server-agent-56ddfc8fdd-hrbwd and namespace elasticsearch to discover kubernetes node
2021-09-28T02:32:15.305Z	INFO	[composable.providers.kubernetes]	kubernetes/util.go:120	kubernetes: Using node ip-172-32-1-197.ap-southeast-2.compute.internal discovered by in cluster pod node query
2021-09-28T02:32:15.409Z	INFO	[api]	api/server.go:62	Starting stats endpoint
2021-09-28T02:32:15.409Z	INFO	application/managed_mode.go:291	Agent is starting
2021-09-28T02:32:15.410Z	INFO	[api]	api/server.go:64	Metrics endpoint listening on: /usr/share/elastic-agent/state/data/tmp/elastic-agent.sock (configured: unix:///usr/share/elastic-agent/state/data/tmp/elastic-agent.sock)
2021-09-28T02:32:15.517Z	INFO	stateresolver/stateresolver.go:48	New State ID is cZbq7FEb
2021-09-28T02:32:15.518Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 2 step(s)
2021-09-28T02:32:15.900Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for fleet-server.7.14.1
2021-09-28T02:32:18.585Z	INFO	log/reporter.go:40	2021-09-28T02:32:18Z - message: Application: fleet-server--7.14.1[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-28T02:32:20.483Z	INFO	log/reporter.go:40	2021-09-28T02:32:20Z - message: Application: fleet-server--7.14.1[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to RUNNING: Running on default policy with Fleet Server integration - type: 'STATE' - sub_type: 'RUNNING'
2021-09-28T02:32:32.096Z	INFO	log/reporter.go:40	2021-09-28T02:32:32Z - message: Application: filebeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-28T02:32:35.485Z	INFO	log/reporter.go:40	2021-09-28T02:32:35Z - message: Application: filebeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-09-28T02:32:46.693Z	INFO	log/reporter.go:40	2021-09-28T02:32:46Z - message: Application: metricbeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-28T02:32:46.696Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:32:46.700Z	INFO	stateresolver/stateresolver.go:48	New State ID is cZbq7FEb
2021-09-28T02:32:46.700Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 0 step(s)
2021-09-28T02:32:46.782Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:32:49.085Z	INFO	log/reporter.go:40	2021-09-28T02:32:49Z - message: Application: metricbeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-09-28T02:32:50.345Z	INFO	stateresolver/stateresolver.go:48	New State ID is zOaO8u1h
2021-09-28T02:32:50.345Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 2 step(s)
2021-09-28T02:32:50.593Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for fleet-server.7.14.1
2021-09-28T02:32:50.593Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for fleet-server.7.14.1
2021-09-28T02:32:50.687Z	INFO	log/reporter.go:40	2021-09-28T02:32:50Z - message: Application: fleet-server--7.14.1[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to CONFIG: Re-configuring - type: 'STATE' - sub_type: 'CONFIG'
2021-09-28T02:32:52.100Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for filebeat.7.14.1
2021-09-28T02:32:52.100Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for filebeat.7.14.1
2021-09-28T02:32:52.183Z	INFO	process/configure.go:50	initiating restart of 'filebeat' due to config change
2021-09-28T02:32:52.193Z	INFO	log/reporter.go:40	2021-09-28T02:32:52Z - message: Application: fleet-server--7.14.1[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to RUNNING: Running on default policy with Fleet Server integration - type: 'STATE' - sub_type: 'RUNNING'
2021-09-28T02:32:52.687Z	INFO	log/reporter.go:40	2021-09-28T02:32:52Z - message: Application: filebeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-09-28T02:32:57.291Z	INFO	log/reporter.go:40	2021-09-28T02:32:57Z - message: Application: filebeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-28T02:33:00.397Z	INFO	operation/operator.go:260	operation 'operation-install' skipped for metricbeat.7.14.1
2021-09-28T02:33:00.397Z	INFO	operation/operator.go:260	operation 'operation-start' skipped for metricbeat.7.14.1
2021-09-28T02:33:00.400Z	INFO	process/configure.go:50	initiating restart of 'metricbeat' due to config change
2021-09-28T02:33:00.583Z	INFO	log/reporter.go:40	2021-09-28T02:33:00Z - message: Application: filebeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-09-28T02:33:00.902Z	INFO	log/reporter.go:40	2021-09-28T02:33:00Z - message: Application: metricbeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to STOPPED: Stopped - type: 'STATE' - sub_type: 'STOPPED'
2021-09-28T02:33:02.102Z	INFO	log/reporter.go:40	2021-09-28T02:33:02Z - message: Application: metricbeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to STARTING: Starting - type: 'STATE' - sub_type: 'STARTING'
2021-09-28T02:33:02.104Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:33:04.784Z	INFO	log/reporter.go:40	2021-09-28T02:33:04Z - message: Application: metricbeat--7.14.1--36643631373035623733363936343635[4dea8434-7f3f-42b5-bb43-1cdcf7f11e93]: State changed to RUNNING: Running - type: 'STATE' - sub_type: 'RUNNING'
2021-09-28T02:33:18.256Z	INFO	stateresolver/stateresolver.go:48	New State ID is zOaO8u1h
2021-09-28T02:33:18.281Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 0 step(s)
2021-09-28T02:33:18.282Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:35:03.133Z	INFO	stateresolver/stateresolver.go:48	New State ID is zOaO8u1h
2021-09-28T02:35:03.133Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 0 step(s)
2021-09-28T02:35:03.182Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:35:04.889Z	INFO	stateresolver/stateresolver.go:48	New State ID is zOaO8u1h
2021-09-28T02:35:04.889Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 0 step(s)
2021-09-28T02:35:04.889Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:37:47.590Z	INFO	stateresolver/stateresolver.go:48	New State ID is zOaO8u1h
2021-09-28T02:37:47.591Z	INFO	stateresolver/stateresolver.go:49	Converging state requires execution of 0 step(s)
2021-09-28T02:37:47.591Z	INFO	stateresolver/stateresolver.go:66	Updating internal state
2021-09-28T02:37:49.215Z	INFO	stateresolver/stateresolver.go:48	New State ID is zOaO8u1h

ruflin · September 30, 2021, 7:58am

What is the size of your fleet-server and what is the configuration you are using?

One thing I stumble over is the first Elastic Agent log you posted: ElasticFleet randomly missing logs - #7 by RichardMatthews Why is this even trying to connect to Kibana? I assume you have 1-2 Elastic Agent with fleet-server and the others are not started with the fleet-server flag?

Could you share a bit more on how it is run and the exact configs / manifests you are using? There are too many connection errors in the logs.

For the log files: Are complete log files are missing or certain events? Are you missing all the logs from a certain host or is it random across hosts? If you are only missing 1 log file of a host but all the others are shipping, it actually sends less like a Fleet problem but much more something around the dynamic inputs used.

@RichardMatthews You mention the that you use the operator, I assume this means ECK? @dkow I might need your help on this one.

dkow · October 1, 2021, 6:53am

Hi @RichardMatthews, thanks for your questions, I'll do my best to help.

A single Fleet Server should be enough handling the 36 Elastic Agents that you've mentioned. The data (logs in your case) that each Agent collects does not go through the Fleet Server. It goes directly to the Elasticsearch cluster. Fleet Agent is there only to distribute configurations. You can read more in the docs.

Couple of questions:

How did you confirm that logs are missing? Is it a single log now and then or all logs during a particular time window?
Can you identify a single Elastic Agent where you can confirm that some logs were dropped and identify when it occured? If yes, then do the following:
- exec into the Pod: kubectl exec -it your_pod_name bash
- go to log path for your namespace: cd state/data/logs/default
- in that directory you should find logs for the Filebeat process that is responsible for gathering and sending the logs
Can you see anything suspicious there? Connection failures, errors, etc.? If yes, can you share them here?
Can you confirm that there are no intermittent networking issues in your cluster? Couple of errors that you've posted look like there are some, eg. etcdserver: request timed out, error: fail to checkin to fleet-server
Kibana 401/500 errors are curious and I'm not sure why they only happen temporarily, but if all your Agents are ultimately healthy in Fleet UI it would seem that it's not the culprit here.
Can you double check that Fleet Settings in Fleet UI (Fleet Server hosts and Elasticsearch hosts) are set correctly, to addresses routable from your Elastic Agents?
Can you run our diagnostics tool and share the results? You can either share them here or send to eck@elastic.co. It will not contain any Secrets, but it will have your ECK-managed resources and their logs, among other things.

If you provide answers to the above, it will be much easier to figure out what's wrong in your case. Please let me know if you'd have any other questions.

Thanks,
David

RichardMatthews · October 3, 2021, 9:19pm

Hi @dkow,

Thanks for your help! Here are the answers to your questions and I have run the diagnostic tool and sent the output to eck@elastic.co.

How did you confirm that logs are missing? Is it a single log now and then or all logs during a particular time window?

We confirmed that the logs are missing because we are transitioning from the AWS managed Elasticsearch to using ECK. When we moved our alerts over we noticed that the alerts on the new system didn't pick up all of the events that where happening on the old system. At first I had thought that the alert wasn't working but when we looked further into it the logs that the alerts need to trigger were missing. These alerts do work sometimes (when there are logs for them). It doesn't appear to be during a particular time window, it looks like random chunks that are missing.

Can you identify a single Elastic Agent where you can confirm that some logs were dropped and identify when it occured? If yes, then do the following:

exec into the Pod: kubectl exec -it your_pod_name bash
go to log path for your namespace: cd state/data/logs/default
in that directory you should find logs for the Filebeat process that is responsible for gathering and sending the logs
Can you see anything suspicious there? Connection failures, errors, etc.? If yes, can you share them here?

{"log.level":"warn","@timestamp":"2021-10-03T21:03:10.474Z","log.logger":"elasticsearch","log.origin":{"file.name":"elasticsearch/client.go","file.line":405},"message":"Cannot index event publisher.Event{Content:beat.Event{Timestamp:time.Time{wall:0xc04ea483659f756d, ext:243062640032189, loc:(*time.Location)(0x5575af4ac320)}, Meta:{\"raw_index\":\"logs-generic-default\"}, Fields:{\"agent\":{\"ephemeral_id\":\"eba9f2cc-130f-42bd-af29-666265978c55\",\"hostname\":\"elastic-agent-agent-zvx4d\",\"id\":\"2f577b92-cf36-4fe8-944b-b17e0a473799\",\"name\":\"elastic-agent-agent-zvx4d\",\"type\":\"filebeat\",\"version\":\"7.14.1\"},\"cloud\":{\"account\":{\"id\":\"221837593202\"},\"availability_zone\":\"ap-southeast-2a\",\"image\":{\"id\":\"ami-04b1878ebf78f7370\"},\"instance\":{\"id\":\"i-0bad5d941b2efd7e6\"},\"machine\":{\"type\":\"m5a.large\"},\"provider\":\"aws\",\"region\":\"ap-southeast-2\",\"service\":{\"name\":\"EC2\"}},\"data_stream\":{\"dataset\":\"generic\",\"namespace\":\"default\",\"type\":\"logs\"},\"ecs\":{\"version\":\"1.10.0\"},\"elastic_agent\":{\"id\":\"2f577b92-cf36-4fe8-944b-b17e0a473799\",\"snapshot\":false,\"version\":\"7.14.1\"},\"event\":{\"dataset\":\"generic\"},\"host\":{\"architecture\":\"x86_64\",\"containerized\":true,\"hostname\":\"elastic-agent-agent-zvx4d\",\"id\":\"94ce1221fac96aa24ab7224bba617b9b\",\"ip\":[\"100.100.206.206\",\"fe80::649c:c9ff:febb:52a8\"],\"mac\":[\"66:9c:c9:bb:52:a8\"],\"name\":\"elastic-agent-agent-zvx4d\",\"os\":{\"codename\":\"Core\",\"family\":\"redhat\",\"kernel\":\"5.8.0-1041-aws\",\"name\":\"CentOS Linux\",\"platform\":\"centos\",\"type\":\"linux\",\"version\":\"7 (Core)\"}},\"input\":{\"type\":\"log\"},\"kubernetes\":{\"container\":{\"id\":\"b0c4e8a03d7f6c1b179db2a4f6d2eb45b5956137e45958182cd02a908d7da40d\",\"image\":\"k8s.gcr.io/ingress-nginx/controller:v0.47.0@sha256:a1e4efc107be0bb78f32eaec37bef17d7a0c81bec8066cdf2572508d21351d0b\",\"name\":\"controller\",\"runtime\":\"containerd\"},\"namespace\":\"production\",\"pod\":{\"ip\":\"100.100.206.232\",\"labels\":{\"app\":{\"kubernetes\":{\"io/component\":\"controller\",\"io/instance\":\"ingress-nginx-alb-production\",\"io/name\":\"ingress-nginx\"}},\"pod-template-hash\":\"5cd45d569d\"},\"name\":\"ingress-nginx-alb-production-controller-5cd45d569d-9w9rt\",\"uid\":\"26316f49-966b-4f1f-9307-627b64654a12\"}},\"log\":{\"file\":{\"path\":\"/var/log/containers/ingress-nginx-alb-production-controller-5cd45d569d-9w9rt_production_controller-b0c4e8a03d7f6c1b179db2a4f6d2eb45b5956137e45958182cd02a908d7da40d.log\"},\"offset\":6027608},\"message\":\"2021-10-03T21:03:08.273466192Z stderr F 2021/10/03 21:03:08 [error] 83#83: *3967636 upstream sent too big header while reading response header from upstream, client: 185.191.171.15, server: mp.natlib.govt.nz, request: \\\"GET /headings?il%5Bsubject%5D=Smoking\\u0026il%5Byear%5D=1882 HTTP/1.1\\\", upstream: \\\"http://100.100.206.212:3000/headings?il%5Bsubject%5D=Smoking\\u0026il%5Byear%5D=1882\\\", host: \\\"mp.natlib.govt.nz\\\"\"}, Private:file.State{Id:\"native::514952-66305\", PrevId:\"\", Finished:false, Fileinfo:(*os.fileStat)(0xc0041b4d00), Source:\"/var/log/containers/ingress-nginx-alb-production-controller-5cd45d569d-9w9rt_production_controller-b0c4e8a03d7f6c1b179db2a4f6d2eb45b5956137e45958182cd02a908d7da40d.log\", Offset:6028012, Timestamp:time.Time{wall:0xc04e9ef076086a3b, ext:237354915346089, loc:(*time.Location)(0x5575af4ac320)}, TTL:-1, Type:\"log\", Meta:map[string]string(nil), FileStateOS:file.StateOS{Inode:0x7db88, Device:0x10301}, IdentifierName:\"native\"}, TimeSeries:false}, Flags:0x1, Cache:publisher.EventCache{m:common.MapStr(nil)}} (status=400): {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [kubernetes.pod.labels.app] of type [keyword] in document with id 'ABD3R3wBvGqm6X9qvHl6'. Preview of field's value: '{kubernetes={io/instance=ingress-nginx-alb-production, io/component=controller, io/name=ingress-nginx}}'\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Can't get text on a START_OBJECT at 1:526\"}}","service.name":"filebeat","ecs.version":"1.6.0"}

{"log.level":"error","@timestamp":"2021-10-03T21:03:11.633Z","log.logger":"kubernetes","log.origin":{"file.name":"add_kubernetes_metadata/matchers.go","file.line":91},"message":"Error extracting container id - source value does not contain matcher's logs_path '/var/lib/docker/containers/'.","service.name":"filebeat","ecs.version":"1.6.0"}

Can you confirm that there are no intermittent networking issues in your cluster? Couple of errors that you've posted look like there are some, eg. etcdserver: request timed out, error: fail to checkin to fleet-server

Unfortunately we do have intermittent networking issues in our cluster that we are trying to solve. They are mainly DNS related issues.

Kibana 401/500 errors are curious and I'm not sure why they only happen temporarily, but if all your Agents are ultimately healthy in Fleet UI it would seem that it's not the culprit here.

Can you double check that Fleet Settings in Fleet UI (Fleet Server hosts and Elasticsearch hosts) are set correctly, to addresses routable from your Elastic Agents?

Yes they are correct.

Can you run our diagnostics tool and share the results? You can either share them here or send to eck@elastic.co. It will not contain any Secrets, but it will have your ECK-managed resources and their logs, among other things.

Hope that helps!
Richard

jamesarobertson · October 6, 2021, 9:00pm

Hi @ruflin & @dkow. Once again, any help to resolve this issue would greatly assist with our adoption of Elastic Stack at the National Library. Thanks very much.

dkow · October 8, 2021, 6:35am

Hey @jamesarobertson, apologies for the delay, I couldn't devote enough time to do a thorough investigation. I did receive your ECK dump (thanks) and I was able to take a quick peek.

Here are few things I've discovered so far:

Fleet Server logs look good to me. Kibana setup and enrolment were done, no errors to be seen.
The same goes for Elastic Agents, it seems they receive new states, update their config etc., I didn't see any issues there.
ES logs contain a lot of the bellow error.
["org.elasticsearch.action.UnavailableShardsException: [elastalert_error][0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[elastalert_error][0]] containing [index {[elastalert_error][_doc][hSuqOXwBiMQctk1Y0xMZ], source[n/a, actual length: [6.2kb], max length: 2kb]}]]"
Kibana logs contain a lot of the bellow error. Could this be the culprit of alerts not firing as expected?
{"type":"log","@timestamp":"2021-10-03T20:54:46+00:00","tags":["error","plugins","alerting","plugins","alerting"],"pid":1209,"message":"Executing Alert \"9986e8c0-052f-11ec-913a-21149e643e1e\" has resulted in Error: illegal_argument_exception: [illegal_argument_exception] Reason: node [elasticsearch-es-data-nodes-0] does not have the [remote_cluster_client] role, caused by: \"\""}
One of the logs you've pasted:
{"file.name":"elasticsearch/client.go","file.line":405},"message":"Cannot index event ... {\"type\":\"mapper_parsing_exception\",\"reason\":\"failed to parse field [kubernetes.pod.labels.app] of type [keyword] in document with id 'ABD3R3wBvGqm6X9qvHl6'. Preview of field's value: '{kubernetes={io/instance=ingress-nginx-alb-production, io/component=controller, io/name=ingress-nginx}}'\",\"caused_by\":{\"type\":\"illegal_state_exception\",\"reason\":\"Can't get text on a START_OBJECT at 1:526\"}}","service.name":"filebeat","ecs.version":"1.6.0"}
looks suspicious and indicate issue with indexing docs on Beat side.

@ruflin, I don't see anything pointing here to be ECK issue. I'll defer to you with triaging this further.

Thanks,
David

Topic		Replies	Views
Agent randomly stops sending data but still checks in to fleet Beats fleet	4	764	August 3, 2021
Elastic agent unhealthy Elastic Cloud on Kubernetes (ECK)	2	445	December 8, 2023
Fail to checkin to fleet-server Elastic Agent fleet	17	6172	July 10, 2023
Fail to checkin to fleet server Elastic Agent	8	1933	December 19, 2022
Elastic-Agent healthy but sends no logs Beats fleet , elastic-agent	3	3690	December 21, 2021

ElasticFleet randomly missing logs

Related topics