Filebeat: "memory leak" via filebeat.autodiscover and >>200.000 goroutines

Hi,

we use filebeat (8.5.3) to scrape logs from kubernetes (1.24) nodes and send them to logstash (8.5.3). We have been doing this since filebeat 6.x and have adjusted the configs to the best of our knowledge during the migrations.

Since the update from 7.17 to 8.5.x, we can see an increasing amount of memory usage in filebeat, until it hits the resource memory limit and gets restarted by k8s. We've allowed up to 6GB, which should be more then enough, but that gets used on a regular basis:

We see this behaviour on all the four k8s clusters we run, but they all run the same filebeat config.

          filebeat.inputs:
            - type: filestream
              enabled: true
              paths:
                - "/var/log/messages"
              id: "ec2-messages-log"
              tags: ["ec2-messages-log"]
          filebeat.autodiscover:
            providers:
              - type: kubernetes
                scope: node
                templates:
                  - condition:
                      equals:
                        kubernetes.container.name: example-app
                    config:
                      - type: container
                        paths:
                          - /var/log/containers/*-${data.kubernetes.container.id}.log
                        processors:
                        - add_kubernetes_metadata:
                            host: ${NODE_NAME}
                            matchers:
                            - logs_path:
                                logs_path: "/var/log/containers/"
                        parsers:
                           - multiline:
                               type: pattern
                               pattern: '^\[(\d{1,2}\/\d{1,2}\/\d{2,4})|(\w{3} \w{3} \d{2})|(\d{2}\/\w{3}\/\d{2,4})'
                               negate: true
                               match: after
                  - condition:
                      not:
                        equals:
                          kubernetes.container.name: example-app
                    config:
                      - type: container
                        paths:
                          - /var/log/containers/*-${data.kubernetes.container.id}.log
                        processors:
                        - add_kubernetes_metadata:
                            host: ${NODE_NAME}
                            matchers:
                            - logs_path:
                                logs_path: "/var/log/containers/"
          output.logstash:
            hosts: ["logstash-logstash:5044"]

By deactivating parts of that config step by step, my guess it that filebeat.autodiscover / kubernetes config could somewhat be the reason for the memory leak. Disabling its processors / parsers did not mitigate the issue.

The pattern match for the container logs should be fine and not matching too wide, and seems to be as advised in the documentation. Having read some old posts about filebeat memory usage, it was suggested that the k8s nodes have too many small logfiles open, or rotate them too often.
Checking the log paths at the logs, there are only ~30 logfiles present at a given time.

Here are the json formated metrics that are logged by one filebeat pod shortly before it hits the memory limit:

{
    "log.level": "info",
    "@timestamp": "2022-12-28T07:57:34.547Z",
    "log.logger": "monitoring",
    "log.origin": {
        "file.name": "log/log.go",
        "file.line": 186
    },
    "message": "Non-zero metrics in the last 30s",
    "service.name": "filebeat",
    "monitoring": {
        "metrics": {
            "beat": {
                "cgroup": {
                    "cpu": {
                        "stats": {
                            "periods": 299,
                            "throttled": {
                                "ns": 6014494525,
                                "periods": 36
                            }
                        }
                    },
                    "cpuacct": {
                        "total": {
                            "ns": 10627530809
                        }
                    },
                    "memory": {
                        "mem": {
                            "usage": {
                                "bytes": 5544570880
                            }
                        }
                    }
                },
                "cpu": {
                    "system": {
                        "ticks": 1703480,
                        "time": {
                            "ms": 860
                        }
                    },
                    "total": {
                        "ticks": 10968490,
                        "time": {
                            "ms": 9990
                        },
                        "value": 10968490
                    },
                    "user": {
                        "ticks": 9265010,
                        "time": {
                            "ms": 9130
                        }
                    }
                },
                "handles": {
                    "limit": {
                        "hard": 1048576,
                        "soft": 1048576
                    },
                    "open": 116
                },
                "info": {
                    "ephemeral_id": "dd93b8f2-8632-4c22-ab15-205dcb332ff4",
                    "uptime": {
                        "ms": 80580095
                    },
                    "version": "8.5.3"
                },
                "memstats": {
                    "gc_next": 6229816968,
                    "memory_alloc": 2781633416,
                    "memory_total": 678799366512,
                    "rss": 5580726272
                },
                "runtime": {
                    "goroutines": 223821
                }
            },
            "filebeat": {
                "events": {
                    "active": 1,
                    "added": 159,
                    "done": 158
                },
                "harvester": {
                    "open_files": 14,
                    "running": 14
                }
            },
            "libbeat": {
                "config": {
                    "module": {
                        "running": 37
                    }
                },
                "output": {
                    "events": {
                        "acked": 159,
                        "active": 0,
                        "batches": 94,
                        "total": 159
                    },
                    "read": {
                        "bytes": 564
                    },
                    "write": {
                        "bytes": 142303
                    }
                },
                "pipeline": {
                    "clients": 38,
                    "events": {
                        "active": 0,
                        "published": 159,
                        "total": 159
                    },
                    "queue": {
                        "acked": 159
                    }
                }
            },
            "registrar": {
                "states": {
                    "current": 51,
                    "update": 155
                },
                "writes": {
                    "success": 20,
                    "total": 20
                }
            },
            "system": {
                "load": {
                    "1": 0.7,
                    "15": 1.12,
                    "5": 0.99,
                    "norm": {
                        "1": 0.175,
                        "15": 0.28,
                        "5": 0.2475
                    }
                }
            }
        },
        "ecs.version": "1.6.0"
    }
}

The overall numbers of handles seem to be okay-ish to me. The number of runtime/goroutines seems to be way to high (223821), though.

Is there any obvious issue with the config that i've missed? Filebeat had memory issues with that config in 7.x for us before, but they vanished after some restarts. With 8.x, we have them consistently in all k8s clusters, with every restart.

Refering to some older posts, i've also tried adding

 queue.mem:
            events: 4096
            flush.min_events: 16
            flush.timeout: 0s

to the config, but the issue remains the same. Rising number of goroutines until the pod gets OOM-killed.

Any advise how to further debug the issue?

Thanks in advance!

Best regards.

Hey @HarryTuttle !

You might hit a known issue here from inside the autodiscovery's core as you correctly assume.

Please have a look at [Metricbeat] Possible memory leak with autodiscover · Issue #33307 · elastic/beats · GitHub and try to disable the deployment/cronjob metadata by adding the following:

add_resource_metadata:
  cronjob: false
  deployment: false

This should be at the same level with scope, templates etc.

Also the respective docs in case you need more info: Autodiscover | Filebeat Reference [8.5] | Elastic (look for add_resource_metadata).

If none of this help, then you need to take some memory heaps using pprof so as we can see which code paths are those that leave the leaked memory footprints (if any).

C.

C.

Hi @ChrsMark,

thanks for your feedback.

I've configured the add_resource_metadata as advised:

# logging.level: warning
# logging.metrics.enabled: false
http.enabled: true
http.pprof.enabled: true
queue.mem:
  events: 4096
  flush.min_events: 16
  flush.timeout: 0s
filebeat.inputs:
  - type: filestream
    enabled: true
    paths:
      - "/var/log/messages"
    id: "ec2-messages-log"
    tags: ["ec2-messages-log"]
filebeat.autodiscover:
  providers:
    - type: kubernetes
      add_resource_metadata:
          # to avoid memory leak
          # https://github.com/elastic/beats/issues/33307#issuecomment-1285647584
          cronjob: false
          deployment: false
      scope: node
      templates:
        - condition:
            equals:
              kubernetes.container.name: <name>
          config:
            - type: container
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log
              processors:
              - add_kubernetes_metadata:
                  host: ${NODE_NAME}
                  matchers:
                  - logs_path:
                      logs_path: "/var/log/containers/"
              parsers:
                 - multiline:
                     type: pattern
                     pattern: '^\[(\d{1,2}\/\d{1,2}\/\d{2,4})|(\w{3} \w{3} \d{2})|(\d{2}\/\w{3}\/\d{2,4})'
                     negate: true
                     match: after
        - condition:
            not:
              equals:
                kubernetes.container.name: <name>
          config:
            - type: container
              paths:
                - /var/log/containers/*-${data.kubernetes.container.id}.log
              processors:
              - add_kubernetes_metadata:
                  host: ${NODE_NAME}
                  matchers:
                  - logs_path:
                      logs_path: "/var/log/containers/"
output.logstash:
  hosts: ["logstash-logstash:5044"]

However, there is still what seems to be a memory leak.

I've done pprof heap profiles for ~10hrs, every 5 minutes. I'll attach only five samples here for now. Let me know if you need more.

1:


25:

50:

75:

100:

Here's the monitoring json log from the point in time where the last heap profile was captured:

{
    "log.level": "info",
    "@timestamp": "2023-01-02T18:35:19.360Z",
    "log.logger": "monitoring",
    "log.origin": {
        "file.name": "log/log.go",
        "file.line": 186
    },
    "message": "Non-zero metrics in the last 30s",
    "service.name": "filebeat",
    "monitoring": {
        "metrics": {
            "beat": {
                "cgroup": {
                    "cpu": {
                        "stats": {
                            "periods": 293,
                            "throttled": {
                                "ns": 1716753194,
                                "periods": 10
                            }
                        }
                    },
                    "cpuacct": {
                        "total": {
                            "ns": 4973747731
                        }
                    },
                    "memory": {
                        "mem": {
                            "usage": {
                                "bytes": 2514362368
                            }
                        }
                    }
                },
                "cpu": {
                    "system": {
                        "ticks": 291860,
                        "time": {
                            "ms": 420
                        }
                    },
                    "total": {
                        "ticks": 1675910,
                        "time": {
                            "ms": 3520
                        },
                        "value": 1675910
                    },
                    "user": {
                        "ticks": 1384050,
                        "time": {
                            "ms": 3100
                        }
                    }
                },
                "handles": {
                    "limit": {
                        "hard": 1048576,
                        "soft": 1048576
                    },
                    "open": 51
                },
                "info": {
                    "ephemeral_id": "2399f1e1-adbc-40e0-8e27-9fd41d7d0edb",
                    "uptime": {
                        "ms": 30600071
                    },
                    "version": "8.5.3"
                },
                "memstats": {
                    "gc_next": 2223539928,
                    "memory_alloc": 1029467712,
                    "memory_sys": 8388608,
                    "memory_total": 136369878048,
                    "rss": 2472394752
                },
                "runtime": {
                    "goroutines": 76468
                }
            },
            "filebeat": {
                "events": {
                    "active": -3,
                    "added": 68,
                    "done": 71
                },
                "harvester": {
                    "open_files": 8,
                    "running": 8
                }
            },
            "libbeat": {
                "config": {
                    "module": {
                        "running": 31
                    }
                },
                "output": {
                    "events": {
                        "acked": 68,
                        "active": 0,
                        "batches": 52,
                        "total": 68
                    },
                    "read": {
                        "bytes": 312
                    },
                    "write": {
                        "bytes": 72870
                    }
                },
                "pipeline": {
                    "clients": 32,
                    "events": {
                        "active": 0,
                        "published": 68,
                        "total": 68
                    },
                    "queue": {
                        "acked": 68
                    }
                }
            },
            "registrar": {
                "states": {
                    "current": 65,
                    "update": 65
                },
                "writes": {
                    "success": 18,
                    "total": 18
                }
            },
            "system": {
                "load": {
                    "1": 1.62,
                    "15": 0.94,
                    "5": 1.08,
                    "norm": {
                        "1": 0.405,
                        "15": 0.235,
                        "5": 0.27
                    }
                }
            }
        },
        "ecs.version": "1.6.0"
    }
}

I can also do longer sessions with another sampling frequency, if necessary. The pods get OOM killed (6G) after about 22hrs, so that is the longest we can offer.

Thanks in advance.

Best regards

Thanks! The memory heaps look familiar indeed.
Can you also remove the

processor?

The kubernetes provider is adding the metadata by default so you don't need the extra processor for this.

C.

Hi @ChrsMark,

removing also this processor seems to have resolved the issue.
Thank you very much for your kind help, it was invaluable!

Best regards

1 Like

Nice @HarryTuttle , good to see the problem is solved for you!

For reference, this issue should be properly tackled at `add_resource_metadata.cronjob` overloads the memory usage · Issue #31 · elastic/elastic-agent-autodiscover · GitHub.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.