Architecture check: Offloading Logstash gsub regex to K8s pod sidecars to save CPU?

Hi everyone,

We’ve been dealing with a classic pipeline bottleneck: our Logstash nodes are burning massive amounts of CPU because we have dozens of gsub mutate filters to scrub PII and secrets (like emails, Stripe tokens, API keys) before indexing to Elasticsearch.

Scaling Logstash just to handle regex processing is getting too expensive, and maintaining the regex patterns for new token formats is a nightmare.

I decided to try shifting this left and offloading it to the edge. I wrote a lightweight Go sidecar for our K8s pods that intercepts the stdout log stream. Instead of pure regex, it calculates Shannon Entropy on the fly to detect random API keys and replaces them with deterministic HMAC hashes (e.g., [HIDDEN:e9f1a2]). It does this before Filebeat/Fluent-bit even picks the logs up.

The Logstash pipeline is now almost empty, and the node CPU dropped drastically. I open-sourced the tool here if anyone wants to see the implementation: https://github.com/aragossa/pii-shield

Has anyone else moved away from central Logstash sanitization to edge-sanitization architecture? Are there any hidden pitfalls with Elasticsearch indexing when doing deterministic hashing at the pod level?

Would appreciate any architecture feedback or roasting of the code!

1 Like