Memory leak on kubernetes nodes

Hi, I see memory issue when i'm using Auditbeat on kubernetes nodes.
I am trying with Auditbeat 8.12 and 8.13.
Here is the config I'm running:

auditbeat.modules:

- module: auditd
  processors:
    - add_session_metadata:
        backend: "auto"
    - add_docker_metadata:
    - drop_event:
        when:
          or:
            - has_fields: ['container']
            - contains:
                process.entry_leader.entry_meta.type: "container"
            - contains:
                process.entry_leader.args: "containerd"

  audit_rules: |
    -a exit,always -F arch=b64 -F euid=0 -S execve -k rootact
    -a exit,always -F arch=b32 -F euid=0 -S execve -k rootact
    -a always,exit -F arch=b64 -S connect -F a2=16 -F success=1 -F key=network_connect_4
    -a always,exit -F arch=b64 -F exe=/bin/bash -F success=1 -S connect -k "remote_shell"
    -a always,exit -F arch=b64 -F exe=/usr/bin/bash -F success=1 -S connect -k "remote_shell"

Auditbeat eat RAM over time, until Linux will OOM kill it, and the process requires a restart.

A sharp decrease in memory consumption on the graph is associated with restarting the auditbeat service or restarting the server.

Thanks in advance.

Hi,

I think the large memory usage is likely being caused by the add_session_metadata processor. It will track all process forks and executions in memory when its used. The memory usage should be bounded, but it can grow quite large.

There was a large architectural redesign of the add_session_metadata processor in 8.16 that may help with this. Could you upgrade to Auditbeat 8.16 or higher? Or if you're not using the data provided by this processor, you could remove it from the config.

Yes, I also believe it is caused by the add_session_metadata handler.
We tried with 8.16 and 8.17 it didn't help.

We use it to exclude events inside containers from the audit. Can you recommend other ways to ensure that only events caused only by the processes of the k8s node are included in the audit, excluding containers?

Can you confirm it is indeed the add_session_metadata processor?
Running one instance without the processor for awhile should be enough to see if the pattern changes.

After that, can you get a memory profile, you have to set http.pprof.enabled, Configure an HTTP endpoint for metrics | Auditbeat Reference [8.17] | Elastic
Just be sure to get a dump when the memory usage is high enough.

Yes, I disabled the add_session_metadata processor and the memory leak stopped.

@dimaz can you try adding -a always,exit -F arch=b64 -S exit_group to your audit rules?

@Michael_Wolf It looks like if we don't get any exit_group events, the processor DB instance won't reap any dead processes?

I added -a always, exit -F arch=b64 -S exit_group and it had no effect on the memory leak.
Here is the memory profile that I got by setting http.prof.enabled.

master node:

{
"auditd": {
"kernel_lost": 31435,
"reassembler_seq_gaps": 884763805174,
"received_msgs": 1492935,
"userspace_lost": 0
},
"beat": {
"cgroup": {
"cpu": {
"cfs": {
"period": {
"us": 100000
},
"quota": {
"us": 0
}
},
"id": "auditbeat.service",
"stats": {
"periods": 0,
"throttled": {
"ns": 0,
"periods": 0
}
}
},
"cpuacct": {
"id": "auditbeat.service",
"total": {
"ns": 341370860453
}
},
"memory": {
"id": "auditbeat.service",
"mem": {
"limit": {
"bytes": 9223372036854771712
},
"usage": {
"bytes": 1127354368
}
}
}
},
"cpu": {
"system": {
"ticks": 46590,
"time": {
"ms": 46590
}
},
"total": {
"ticks": 341360,
"time": {
"ms": 341360
},
"value": 341360
},
"user": {
"ticks": 294770,
"time": {
"ms": 294770
}
}
},
"handles": {
"limit": {
"hard": 262144,
"soft": 262144
},
"open": 26
},
"info": {
"ephemeral_id": "13f39bfc-d91f-43f0-9a95-21636139775d",
"name": "auditbeat",
"uptime": {
"ms": 190182143
},
"version": "8.17.0"
},
"memstats": {
"gc_next": 40572544,
"memory_alloc": 26293464,
"memory_sys": 79088195,
"memory_total": 56586289080,
"rss": 1112170496
},
"runtime": {
"goroutines": 27
}
},
"libbeat": {
"config": {
"module": {
"running": 0,
"starts": 0,
"stops": 0
},
"reloads": 0,
"scans": 0
},
"output": {
"batches": {
"split": 0
},
"events": {
"acked": 316014,
"active": 0,
"batches": 30604,
"dead_letter": 0,
"dropped": 0,
"duplicates": 0,
"failed": 23,
"toomany": 0,
"total": 316037
},
"read": {
"bytes": 1648146,
"errors": 1
},
"type": "logstash",
"write": {
"bytes": 126110743,
"errors": 1,
"latency": {
"histogram": {
"count": 0,
"max": 0,
"mean": 0,
"median": 0,
"min": 0,
"p75": 0,
"p95": 0,
"p99": 0,
"p999": 0,
"stddev": 0
}
}
}
},
"pipeline": {
"clients": 1,
"events": {
"active": 2,
"dropped": 0,
"failed": 0,
"filtered": 0,
"published": 316015,
"retry": 53,
"total": 316016
},
"queue": {
"acked": 316014,
"added": {
"bytes": 0,
"events": 316015
},
"consumed": {
"bytes": 0,
"events": 316014
},
"filled": {
"bytes": 0,
"events": 1,
"pct": 0.00048828125
},
"max_bytes": 0,
"max_events": 2048,
"removed": {
"bytes": 0,
"events": 316014
}
}
}
},
"metricbeat": {
"auditd": {
"auditd": {
"consecutive_failures": 0,
"events": 316017,
"failures": 116,
"success": 315903
}
}
},
"system": {
"cpu": {
"cores": 2
},
"load": {
"1": 0.34,
"15": 0.28,
"5": 0.27,
"norm": {
"1": 0.17,
"15": 0.14,
"5": 0.135
}
}
}
}

worker node:

{
"auditd": {
"kernel_lost": 22594,
"reassembler_seq_gaps": 5793911785175,
"received_msgs": 5855068,
"userspace_lost": 0
},
"beat": {
"cgroup": {
"cpu": {
"cfs": {
"period": {
"us": 100000
},
"quota": {
"us": 0
}
},
"id": "auditbeat.service",
"stats": {
"periods": 0,
"throttled": {
"ns": 0,
"periods": 0
}
}
},
"cpuacct": {
"id": "auditbeat.service",
"total": {
"ns": 1031281329977
}
},
"memory": {
"id": "auditbeat.service",
"mem": {
"limit": {
"bytes": 9223372036854771712
},
"usage": {
"bytes": 5337092096
}
}
}
},
"cpu": {
"system": {
"ticks": 139370,
"time": {
"ms": 139370
}
},
"total": {
"ticks": 1031270,
"time": {
"ms": 1031270
},
"value": 1031270
},
"user": {
"ticks": 891900,
"time": {
"ms": 891900
}
}
},
"handles": {
"limit": {
"hard": 262144,
"soft": 262144
},
"open": 58
},
"info": {
"ephemeral_id": "057fbe51-89e4-4f37-8bff-6663e94dd299",
"name": "auditbeat",
"uptime": {
"ms": 318090029
},
"version": "8.17.0"
},
"memstats": {
"gc_next": 48628040,
"memory_alloc": 32350416,
"memory_sys": 127224387,
"memory_total": 136483867096,
"rss": 5384351744
},
"runtime": {
"goroutines": 29
}
},
"libbeat": {
"config": {
"module": {
"running": 0,
"starts": 0,
"stops": 0
},
"reloads": 0,
"scans": 0
},
"output": {
"batches": {
"split": 0
},
"events": {
"acked": 852111,
"active": 0,
"batches": 54461,
"dead_letter": 0,
"dropped": 0,
"duplicates": 0,
"failed": 97,
"toomany": 0,
"total": 852208
},
"read": {
"bytes": 2701408,
"errors": 2
},
"type": "logstash",
"write": {
"bytes": 382956376,
"errors": 3,
"latency": {
"histogram": {
"count": 0,
"max": 0,
"mean": 0,
"median": 0,
"min": 0,
"p75": 0,
"p95": 0,
"p99": 0,
"p999": 0,
"stddev": 0
}
}
}
},
"pipeline": {
"clients": 1,
"events": {
"active": 5,
"dropped": 0,
"failed": 0,
"filtered": 24376,
"published": 852115,
"retry": 258,
"total": 876492
},
"queue": {
"acked": 852111,
"added": {
"bytes": 0,
"events": 852115
},
"consumed": {
"bytes": 0,
"events": 852111
},
"filled": {
"bytes": 0,
"events": 4,
"pct": 0.001953125
},
"max_bytes": 0,
"max_events": 2048,
"removed": {
"bytes": 0,
"events": 852111
}
}
}
},
"metricbeat": {
"auditd": {
"auditd": {
"consecutive_failures": 0,
"events": 876493,
"failures": 959,
"success": 875538
}
}
},
"system": {
"cpu": {
"cores": 10
},
"load": {
"1": 0.1,
"15": 0.12,
"5": 0.1,
"norm": {
"1": 0.01,
"15": 0.012,
"5": 0.01
}
}
}
}

Alright, I can't reproduce the full leak, but I suspect that there's some kind of behavior going on here that's system dependent.

@dimaz

The docs mention setting a number of audit rules specifically for this processor, can you make sure they're all set?

    ## executions
    -a always,exit -F arch=b64 -S execve,execveat -k exec
    -a always,exit -F arch=b64 -S exit_group
    ## set_sid
    -a always,exit -F arch=b64 -S setsid

Also, can you tell us more about the environment you're running on? Is this a K8s cloud service? Is there a host you can run uname -a on and give us the output?

We don't have a rule
-a always,exit -F arch=b64 -S setsid
I'll add it.

These are our own on-premises Kubernetes servers.

# uname -a
Linux hostname.example.com 5.15.0-210.163.7.el8uek.x86_64

When server activity is low, memory leakage occurs slowly and increases with increasing load.

So, the good news is that I think I've reproduced this. I think that different components are just keying off of different values in the database that the processor uses. Still haven't found a workaround.

Currently working on a fix.