I've noticed Filebeat stopped sending logs to ES after awhile (seems like some pods are restarting - might be the Jaeger agent with the cleanup jobs start around midnight). It's always caused by an empty index name due to variable replacement failure, but I can't seem to figure out why... I've kept adding filters (to prevent picking up logs from pods that don't have the metadata I'm using) and a drop_event processor to prevent picking up events where no kubernetes metadata is available.
Errors are like
2020-11-16T09:16:02.819Z DEBUG [elasticsearch] elasticsearch/client.go:390 Bulk item insert failed (i=0, status=500): {"type":"string_index_out_of_bounds_exception","reason":"String index out of range: 0"}
2020-11-16T09:16:02.819Z DEBUG [elasticsearch] elasticsearch/client.go:390 Bulk item insert failed (i=1, status=500): {"type":"string_index_out_of_bounds_exception","reason":"String index out of range: 0"}
Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":249200,"time":{"ms":16}},"total":{"ticks":3447000,"time":{"ms":688},"value":3447000},"user":{"ticks":3197800,"time":{"ms":672}}},"handles":{"limit":{"hard":65536,"soft":65536},"open":10},"info":{"ephemeral_id":"1e130f4a-c465-4398-844a-8437c48be4ab","uptime":{"ms":238260052}},"memstats":{"gc_next":32620960,"memory_alloc":21225256,"memory_total":366173647424},"runtime":{"goroutines":144}},"filebeat":{"harvester":{"open_files":0,"running":0}},"libbeat":{"config":{"module":{"running":0}},"output":{"events":{"batches":4,"failed":8,"total":8}},"pipeline":{"clients":13,"events":{"active":2389,"retry":18}}},"registrar":{"states":{"current":68}},"system":{"load":{"1":5.4,"15":7.36,"5":7.24,"norm":{"1":2.7,"15":3.68,"5":3.62}}}}}}
Here's my config:
filebeat.yml: |-
setup:
ilm.enabled: false
template:
name: "mylogs"
pattern: "mylogs-*"
overwrite: true
filebeat.autodiscover:
providers:
- type: kubernetes
node: ${NODE_NAME}
templates:
- condition.and:
- equals:
kubernetes.namespace: sandbox
- not.contains:
kubernetes.pod.name: jaeger-collector
- not.contains:
kubernetes.pod.name: jaeger-spark
config:
- type: container
tail_files: true
symlinks: true
paths:
- /var/log/containers/*-${data.kubernetes.container.id}.log
processors:
- add_cloud_metadata:
- add_host_metadata:
- drop_event:
when:
not:
regexp:
kubernetes.namespace: ".*"
- decode_json_fields:
fields: ["message"]
max_depth: 8
target: ""
overwrite_keys: true
output.elasticsearch:
hosts: ['${ELASTICSEARCH_HOST:elasticsearch}:${ELASTICSEARCH_PORT:9200}']
index: "kytelogs-%{[kubernetes.labels.app]}-%{[kubernetes.labels.domain]}-%{[agent.version]}-%{+yyyy.MM.dd}"
Usually, this failed every night. After adding the drop_event
processor, it lasted the weekend, but then it started failing again.
A quick fix is to just delete the filebeat pod, but it's a lot of manual work and it still results in lost logs. I would appreciate any suggestions on how to debug this further.