Elastic Agent conditions-based autodiscover doesn't pick up newly-scheduled pods/containers

I am currently using Elastic Cloud, v8.6.1, with Elastic Agent Standalone v8.6.0 deployed to EKS, running Kubernetes v1.22.16 in our non-production cluster and v1.21.14 in our production cluster (to be updated this weekend to 1.22.14). I've also confirmed this issue with Elastic Agent v8.6.1 in both EKS clusters.

Since upgrading to Elastic Agent v8.6.x, I've started using condition-based autodiscovery for Kubernetes pods in Elastic Agent. When pod logs rotate, they use a copyrotate strategy. Elastic Agent continues ingesting the logs as expected when this happens. However, when a new pod starts, Elastic Agent does not appear to start monitoring until the agent is restarted. This includes when:

  • A new deployment happens on a node for the first time (i.e., no prior instances of that deployment were running).
  • A pod restarts on the same node.
  • A pod stops on one node and a new instance starts on another node (i.e., when a pod "switches nodes").

I have also tested this with filebeat-8.6.1 deployed to my cluster using the older-style filebeat.autodiscover.providers hints-based autodiscovery, and it does not appear to exhibit the same behavior.

Here is a sample input configuration, taken from a pod for which this is particularly noticeable, since it generally terminates after about two hours and a new instance is scheduled to another node:

---
inputs:
  - id: 'filestream-myapp-0d9ea285-223f-4b6a-9207-9432b0eac168'
    name: 'filestream-myapp'
    type: 'filestream'
    use_output: 'default'
    data_stream:
      namespace: 'default'
    streams:
      # mycontainer
      - id: 'logs-${kubernetes.pod.name}-${kubernetes.container.id}'
        condition: '${kubernetes.container.name} == "mycontainer"'
        data_stream:
          dataset: 'myapp.log'
          type: 'logs'
        parsers:
          - container:
              stream: 'all'
              format: 'auto'
          - ndjson:
              expand_keys: true
              ignore_decoding_error: true
              overwrite_keys: true
              target: ''
        paths: ['/var/log/containers/*${kubernetes.container.id}.log']
        pipeline: 'logs-myapp.log'
        processors:
          - add_locale:
              format: 'offset'
        prospector.scanner.symlinks: true
        tags: ['myapp']

I do not currently have providers.kubernetes.hints.enabled: true set (the documentation for conditions-based autodiscovery does not require it), but I have tested it both ways with the same result.

What appears to be happening is that Elastic Agent isn't tracking the pod scheduler, so it doesn't pick up new containers when they're scheduled. It seems that there are three possibilities:

  • There's a bug in Elastic Agent.
  • I have an issue in my Elastic Agent config.
  • I have the wrong permissions set in my ClusterRole (see below):
    ---
    apiVersion: 'rbac.authorization.k8s.io/v1'
    kind: 'ClusterRole'
    metadata:
      name: 'elastic-agent'
    rules:
      - apiGroups: ['']
        resources:
          - 'configmaps'
          - 'events'
          - 'namespaces'
          - 'nodes'
          - 'pods'
          - 'services'
    
          # Needed for cloudbeat
          - 'persistentvolumeclaims'
          - 'persistentvolumes'
          - 'serviceaccounts'
    
        verbs: ['get', 'watch', 'list']
    
      - apiGroups: ['']
        resources: ['nodes/stats']
        verbs: ['get']
    
      - apiGroups: ['apps']
        resources:
          - 'daemonsets'
          - 'deployments'
          - 'replicasets'
          - 'statefulsets'
        verbs: ['get', 'watch', 'list']
    
      - apiGroups: ['batch']
        resources:
          - 'cronjobs'
          - 'jobs'
        verbs: ['get', 'watch', 'list']
    
      - apiGroups: ['coordination.k8s.io']
        resources: ['leases']
        verbs: ['get', 'create', 'update']
    
      - apiGroups: ['extensions']
        resources: ['replicasets']
        verbs: ['get', 'watch', 'list']
    
      - apiGroups: ['storage.k8s.io']
        resources:
          - 'storageclasses'
        verbs: ['get', 'watch', 'list']
    
      # Needed for cloudbeat
      - apiGroups: ['rbac.authorization.k8s.io']
        resources:
          - 'clusterrolebindings'
          - 'clusterroles'
          - 'rolebindings'
          - 'roles'
        verbs: ['get', 'watch', 'list']
    
      - apiGroups: ['policy']
        resources: ['podsecuritypolicies']
        verbs: ['get', 'watch', 'list']
    
      # Needed for apiserver
      - nonResourceURLs: ['/metrics']
        verbs: ['get']
    

Hey DougR,

I noticed the same issue in our environment as well. --> Elastic-Agent autodiscovery for Kubernetes pod logs not working after upgrading to 8.6.x

So I think we both came across the same problem. Good to know it's nothing that just occured in our environment. Our configs look quite similar.

In my case the same agent policy works fine with Elastic Agent 8.5.3. Therefore I'm currently thinking it's a bug in the new version. Would be nice if somebody could confirm, if it's a bug and not a config error.

1 Like

Looks like we found the issue at about the same time. I'm going to try to downgrade to 8.5.3 again tomorrow and see what happens (I may not have given it enough time).

Either way, it doesn't appear that there's a bug open on this yet.

After downgrading to 8.5.3, I let it run overnight, and everything is ingesting as inspected. It doesn't look as if any bugs are open on this, so I've submitted a bug report:

Elastic Agent 8.6.x standalone deployment in Kubernetes doesn't start monitoring new pods until agent restart · Issue #2269 · elastic/elastic-agent · GitHub

1 Like

Thanks!

I've observed a new pod instance spinning up on a different node for the input I posted above, and logs are being ingested, as expected.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.