Memory leak on kubernetes nodes

Hi, I see memory issue when i'm using Auditbeat on kubernetes nodes.
I am trying with Auditbeat 8.12 and 8.13.
Here is the config I'm running:

auditbeat.modules:

- module: auditd
  processors:
    - add_session_metadata:
        backend: "auto"
    - add_docker_metadata:
    - drop_event:
        when:
          or:
            - has_fields: ['container']
            - contains:
                process.entry_leader.entry_meta.type: "container"
            - contains:
                process.entry_leader.args: "containerd"

  audit_rules: |
    -a exit,always -F arch=b64 -F euid=0 -S execve -k rootact
    -a exit,always -F arch=b32 -F euid=0 -S execve -k rootact
    -a always,exit -F arch=b64 -S connect -F a2=16 -F success=1 -F key=network_connect_4
    -a always,exit -F arch=b64 -F exe=/bin/bash -F success=1 -S connect -k "remote_shell"
    -a always,exit -F arch=b64 -F exe=/usr/bin/bash -F success=1 -S connect -k "remote_shell"

Auditbeat eat RAM over time, until Linux will OOM kill it, and the process requires a restart.

A sharp decrease in memory consumption on the graph is associated with restarting the auditbeat service or restarting the server.

Thanks in advance.

Hi,

I think the large memory usage is likely being caused by the add_session_metadata processor. It will track all process forks and executions in memory when its used. The memory usage should be bounded, but it can grow quite large.

There was a large architectural redesign of the add_session_metadata processor in 8.16 that may help with this. Could you upgrade to Auditbeat 8.16 or higher? Or if you're not using the data provided by this processor, you could remove it from the config.

Yes, I also believe it is caused by the add_session_metadata handler.
We tried with 8.16 and 8.17 it didn't help.

We use it to exclude events inside containers from the audit. Can you recommend other ways to ensure that only events caused only by the processes of the k8s node are included in the audit, excluding containers?

Can you confirm it is indeed the add_session_metadata processor?
Running one instance without the processor for awhile should be enough to see if the pattern changes.

After that, can you get a memory profile, you have to set http.pprof.enabled, Configure an HTTP endpoint for metrics | Auditbeat Reference [8.17] | Elastic
Just be sure to get a dump when the memory usage is high enough.

Yes, I disabled the add_session_metadata processor and the memory leak stopped.

@dimaz can you try adding -a always,exit -F arch=b64 -S exit_group to your audit rules?

@Michael_Wolf It looks like if we don't get any exit_group events, the processor DB instance won't reap any dead processes?

I added -a always, exit -F arch=b64 -S exit_group and it had no effect on the memory leak.
Here is the memory profile that I got by setting http.prof.enabled.

master node:

{
"auditd": {
"kernel_lost": 31435,
"reassembler_seq_gaps": 884763805174,
"received_msgs": 1492935,
"userspace_lost": 0
},
"beat": {
"cgroup": {
"cpu": {
"cfs": {
"period": {
"us": 100000
},
"quota": {
"us": 0
}
},
"id": "auditbeat.service",
"stats": {
"periods": 0,
"throttled": {
"ns": 0,
"periods": 0
}
}
},
"cpuacct": {
"id": "auditbeat.service",
"total": {
"ns": 341370860453
}
},
"memory": {
"id": "auditbeat.service",
"mem": {
"limit": {
"bytes": 9223372036854771712
},
"usage": {
"bytes": 1127354368
}
}
}
},
"cpu": {
"system": {
"ticks": 46590,
"time": {
"ms": 46590
}
},
"total": {
"ticks": 341360,
"time": {
"ms": 341360
},
"value": 341360
},
"user": {
"ticks": 294770,
"time": {
"ms": 294770
}
}
},
"handles": {
"limit": {
"hard": 262144,
"soft": 262144
},
"open": 26
},
"info": {
"ephemeral_id": "13f39bfc-d91f-43f0-9a95-21636139775d",
"name": "auditbeat",
"uptime": {
"ms": 190182143
},
"version": "8.17.0"
},
"memstats": {
"gc_next": 40572544,
"memory_alloc": 26293464,
"memory_sys": 79088195,
"memory_total": 56586289080,
"rss": 1112170496
},
"runtime": {
"goroutines": 27
}
},
"libbeat": {
"config": {
"module": {
"running": 0,
"starts": 0,
"stops": 0
},
"reloads": 0,
"scans": 0
},
"output": {
"batches": {
"split": 0
},
"events": {
"acked": 316014,
"active": 0,
"batches": 30604,
"dead_letter": 0,
"dropped": 0,
"duplicates": 0,
"failed": 23,
"toomany": 0,
"total": 316037
},
"read": {
"bytes": 1648146,
"errors": 1
},
"type": "logstash",
"write": {
"bytes": 126110743,
"errors": 1,
"latency": {
"histogram": {
"count": 0,
"max": 0,
"mean": 0,
"median": 0,
"min": 0,
"p75": 0,
"p95": 0,
"p99": 0,
"p999": 0,
"stddev": 0
}
}
}
},
"pipeline": {
"clients": 1,
"events": {
"active": 2,
"dropped": 0,
"failed": 0,
"filtered": 0,
"published": 316015,
"retry": 53,
"total": 316016
},
"queue": {
"acked": 316014,
"added": {
"bytes": 0,
"events": 316015
},
"consumed": {
"bytes": 0,
"events": 316014
},
"filled": {
"bytes": 0,
"events": 1,
"pct": 0.00048828125
},
"max_bytes": 0,
"max_events": 2048,
"removed": {
"bytes": 0,
"events": 316014
}
}
}
},
"metricbeat": {
"auditd": {
"auditd": {
"consecutive_failures": 0,
"events": 316017,
"failures": 116,
"success": 315903
}
}
},
"system": {
"cpu": {
"cores": 2
},
"load": {
"1": 0.34,
"15": 0.28,
"5": 0.27,
"norm": {
"1": 0.17,
"15": 0.14,
"5": 0.135
}
}
}
}

worker node:

{
"auditd": {
"kernel_lost": 22594,
"reassembler_seq_gaps": 5793911785175,
"received_msgs": 5855068,
"userspace_lost": 0
},
"beat": {
"cgroup": {
"cpu": {
"cfs": {
"period": {
"us": 100000
},
"quota": {
"us": 0
}
},
"id": "auditbeat.service",
"stats": {
"periods": 0,
"throttled": {
"ns": 0,
"periods": 0
}
}
},
"cpuacct": {
"id": "auditbeat.service",
"total": {
"ns": 1031281329977
}
},
"memory": {
"id": "auditbeat.service",
"mem": {
"limit": {
"bytes": 9223372036854771712
},
"usage": {
"bytes": 5337092096
}
}
}
},
"cpu": {
"system": {
"ticks": 139370,
"time": {
"ms": 139370
}
},
"total": {
"ticks": 1031270,
"time": {
"ms": 1031270
},
"value": 1031270
},
"user": {
"ticks": 891900,
"time": {
"ms": 891900
}
}
},
"handles": {
"limit": {
"hard": 262144,
"soft": 262144
},
"open": 58
},
"info": {
"ephemeral_id": "057fbe51-89e4-4f37-8bff-6663e94dd299",
"name": "auditbeat",
"uptime": {
"ms": 318090029
},
"version": "8.17.0"
},
"memstats": {
"gc_next": 48628040,
"memory_alloc": 32350416,
"memory_sys": 127224387,
"memory_total": 136483867096,
"rss": 5384351744
},
"runtime": {
"goroutines": 29
}
},
"libbeat": {
"config": {
"module": {
"running": 0,
"starts": 0,
"stops": 0
},
"reloads": 0,
"scans": 0
},
"output": {
"batches": {
"split": 0
},
"events": {
"acked": 852111,
"active": 0,
"batches": 54461,
"dead_letter": 0,
"dropped": 0,
"duplicates": 0,
"failed": 97,
"toomany": 0,
"total": 852208
},
"read": {
"bytes": 2701408,
"errors": 2
},
"type": "logstash",
"write": {
"bytes": 382956376,
"errors": 3,
"latency": {
"histogram": {
"count": 0,
"max": 0,
"mean": 0,
"median": 0,
"min": 0,
"p75": 0,
"p95": 0,
"p99": 0,
"p999": 0,
"stddev": 0
}
}
}
},
"pipeline": {
"clients": 1,
"events": {
"active": 5,
"dropped": 0,
"failed": 0,
"filtered": 24376,
"published": 852115,
"retry": 258,
"total": 876492
},
"queue": {
"acked": 852111,
"added": {
"bytes": 0,
"events": 852115
},
"consumed": {
"bytes": 0,
"events": 852111
},
"filled": {
"bytes": 0,
"events": 4,
"pct": 0.001953125
},
"max_bytes": 0,
"max_events": 2048,
"removed": {
"bytes": 0,
"events": 852111
}
}
}
},
"metricbeat": {
"auditd": {
"auditd": {
"consecutive_failures": 0,
"events": 876493,
"failures": 959,
"success": 875538
}
}
},
"system": {
"cpu": {
"cores": 10
},
"load": {
"1": 0.1,
"15": 0.12,
"5": 0.1,
"norm": {
"1": 0.01,
"15": 0.012,
"5": 0.01
}
}
}
}

Alright, I can't reproduce the full leak, but I suspect that there's some kind of behavior going on here that's system dependent.

@dimaz

The docs mention setting a number of audit rules specifically for this processor, can you make sure they're all set?

    ## executions
    -a always,exit -F arch=b64 -S execve,execveat -k exec
    -a always,exit -F arch=b64 -S exit_group
    ## set_sid
    -a always,exit -F arch=b64 -S setsid

Also, can you tell us more about the environment you're running on? Is this a K8s cloud service? Is there a host you can run uname -a on and give us the output?

We don't have a rule
-a always,exit -F arch=b64 -S setsid
I'll add it.

These are our own on-premises Kubernetes servers.

# uname -a
Linux hostname.example.com 5.15.0-210.163.7.el8uek.x86_64

When server activity is low, memory leakage occurs slowly and increases with increasing load.

So, the good news is that I think I've reproduced this. I think that different components are just keying off of different values in the database that the processor uses. Still haven't found a workaround.

Currently working on a fix.

Hi all! I have a related question.

While we are waiting for a fix, we have limited auditbeat memory usage with cgroups. We have noticed that when auditbeat is low on memory, disk IO usage starts to increase significantly though swap is off and no tools like systemd-swap are used. Using 'top' we have noticed a heavy work of kswapd. As a result the work of our servers can be interrupted.

Could you please kindly clarify why this happens and if this situation can be avoided?

For reference, the fix for the memory leak issue should be available in 8.19.

Have you been able toe verify that auditbeat is the process using swap memory, you can check by running /proc/[pid of auditbeat]/status | grep VmSwap

1 Like

Thanks, currently VmSwap is 0 everywhere, we will try to check again when the memory consumption starts growing. We will also check v8.19.

Could you please provide an update on when I can expect the memory leak problem to be fixed?

Thank you for your assistance.

@Alex_Kristiansen fixed it in February here: Handle leak of process info in `hostfs` provider for `add_session_met… · elastic/beats@d6ff82b · GitHub

Hi everybody,

We built it from the commit d6ff82b. It's 9.1.0 version. But is does not support add_session_metadata.

This is the full error message:
auditbeat[3320949]: Exiting: the processor action add_session_metadata does not exist. Valid actions: add_fields, add_network_direction, append, community_id, decode_xml, dissect, truncate_fields, add_kubernetes_metadata, add_labels, decompress_gzip_field, add_observer_metadata, decode_duration, decode_base64_field, decode_json_fields, drop_event, include_fields, add_host_metadata, dns, extract_array, uppercase, add_locale, add_process_metadata, decode_xml_wineventlog, registered_domain, syslog, convert, add_formatted_index, add_tags, lowercase, rename, replace, add_docker_metadata, add_id, translate_ldap_attribute, urldecode, copy_fields, drop_fields, add_cloud_metadata, rate_limit, detect_mime_type, fingerprint, move_fields, script

Is there any version of auditbeat which supports add_session_metadata and also has fixed memory leak?