Filebeat module design

In the past, most of the filebeat modules were using elasticsearch ingest pipelines to do the enrichment of documents. This means that they are as lightweight as possible, since the work is done centrally and scales well, since more ingest nodes can be deployed when needed. Logs are shipped in raw format, keeping the network load as small as possible. The introduction of elastic-agent and ingest-management integration packages makes it even simpler, since most of the module configuration like pipelines does not need to be shipped to the beats. Development or fixing of modules is easy since most of the work can be done with pipeline development on Kibana dev-tools.

Recently there have been a lot of new datasets and modules added to filebeat, what is great!
But it seems that now most the enrichment is done with javascript processors on filebeat itself. This means more performance requirements on the beats, more network load and more complex pipeline development.

Is there some roadmap on how the modules evolve in the future? I.e. enrichment centrally or distributed?

Regards
Bernhard

This is a tricky question and one we are also discussing internally. Unfortunately as usual, it depends. In general I think it can be said, that we are pushing more processing centrally as it makes it simple to manage with Ingest Manager and we have a central point to update. Also it removes movable parts.

But there are cases where enrichment must happen on the edge, be it because the info only exist on the edge or the info is required to filter data out.

What it comes down to is that it is often a case by case decision. But we haven't been to consistent in the past and should get better at it.

My personal take is in case you are thinking of building your own packages: Do the processing centrally until you hit an issue and only then process on the edge.

@rufflin Thanks for clarification. It makes sense. The point is that it seems to me that recently it goes the other way around. E.G. Sophos module was created some time ago with ingest pipelines and now got moved to Javascript processors. As far as I can see, most of the new modules which are based on RSA sources are using edge processing as well. IMHO, edge processing is only needed in rare cases and could be even more reduced by implementation of additional processors like dns etc. As you have mentioned as well, central processing greatly simplifies things, scales well and is easy to maintain. I hope that there will be a path to move modules and integrations to central processing as much as possible.

1 Like

I have another perspective on this, and would also strongly advocate for central processing in ingest pipelines wherever possible.

Consider a scenario where I want to parse nginx logs, but they're collected through journald. The Nginx module essentially relies on consuming nginx access logs in plain old logfiles. Even if I were to use the advanced "input override" feature, there is no journald input currently.

If nginx access log processing instead was encapsulated in an ingest pipeline, I could ingest the data using a journalbeat that points to a preprocessing pipeline. That pipeline strips the journald specialities, and emits a compatible document into the nginx_access pipeline. That's modularity. Its also a proven workflow in the logstash world.

Basically, beat modules with heavy .js processor use are great to get results quickly in default scenarios, but they're not flexible enough at all in my experience. Tweaking the .js processors or cloning and adapting the module is really not an option, because you lose all your tweaks on the next release (or have to painstakingly merge your customizations with the new module version).

Another path forward would be to make Beat modules way more customizable. Maybe add the ability to preprocess data before sending it to the module processors. And more inputs certainly help, but I'm sure the team is always on that.

2 Likes