Filebeat reporting stops after index errors

I've noticed Filebeat stopped sending logs to ES after awhile (seems like some pods are restarting - might be the Jaeger agent with the cleanup jobs start around midnight). It's always caused by an empty index name due to variable replacement failure, but I can't seem to figure out why... I've kept adding filters (to prevent picking up logs from pods that don't have the metadata I'm using) and a drop_event processor to prevent picking up events where no kubernetes metadata is available.

Errors are like

2020-11-16T09:16:02.819Z	DEBUG	[elasticsearch]	elasticsearch/client.go:390	Bulk item insert failed (i=0, status=500): {"type":"string_index_out_of_bounds_exception","reason":"String index out of range: 0"}
2020-11-16T09:16:02.819Z	DEBUG	[elasticsearch]	elasticsearch/client.go:390	Bulk item insert failed (i=1, status=500): {"type":"string_index_out_of_bounds_exception","reason":"String index out of range: 0"}
Non-zero metrics in the last 30s	{"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":249200,"time":{"ms":16}},"total":{"ticks":3447000,"time":{"ms":688},"value":3447000},"user":{"ticks":3197800,"time":{"ms":672}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":10},"info":{"ephemeral_id":"1e130f4a-c465-4398-844a-8437c48be4ab","uptime":{"ms":238260052}},"memstats":{"gc_next":32620960,"memory_alloc":21225256,"memory_total":366173647424},"runtime":{"goroutines":144}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":4,"failed":8,"total":8}},"pipeline":{"clients":13,"events":{"active":2389,"retry":18}}},"registrar":{"states":{"current":68}},"system":{"load":{"1":5.4,"15":7.36,"5":7.24,"norm":{"1":2.7,"15":3.68,"5":3.62}}}}}}

Here's my config:

filebeat.yml: |-
    setup:
      ilm.enabled: false
      template:
        name: "mylogs"
        pattern: "mylogs-*"
        overwrite: true

    filebeat.autodiscover:
        providers:
        - type: kubernetes
          node: ${NODE_NAME}
          templates:
            - condition.and:
              - equals: 
                  kubernetes.namespace: sandbox
              - not.contains:
                  kubernetes.pod.name: jaeger-collector
              - not.contains:
                  kubernetes.pod.name: jaeger-spark
              config:
                - type: container
                  tail_files: true
                  symlinks: true
                  paths:
                    - /var/log/containers/*-${data.kubernetes.container.id}.log

    processors:
      - add_cloud_metadata:
      - add_host_metadata:
      
      - drop_event:
          when:
            not:
              regexp:
                kubernetes.namespace: ".*"

      - decode_json_fields:
          fields: ["message"]
          max_depth: 8
          target: ""
          overwrite_keys: true

    output.elasticsearch:  
      hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
      index: "kytelogs-%{[kubernetes.labels.app]}-%{[kubernetes.labels.domain]}-%{[agent.version]}-%{+yyyy.MM.dd}" 

Usually, this failed every night. After adding the drop_event processor, it lasted the weekend, but then it started failing again.

A quick fix is to just delete the filebeat pod, but it's a lot of manual work and it still results in lost logs. I would appreciate any suggestions on how to debug this further.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.