Multiple prospectors vs. More work in pipeline

Hi,

I have a design question. Let's say I have 1 filebeat process to monitor X log-files. The log-data is sent to ES through a pipeline. The pipeline is anyway required for date/timestamp processing.
The data when stored in ES has to be enriched with some extra meta-data in the form of extra document fields and tags.

I see two approaches to accomplish this enrichment:
(1) Filebeat with 1 prospector per file making it possible to add the extra fields and tags immediately in the prospector configuration (hardcoding). And thus, much less work to be done in the pipeline.

(2) Filebeat with 1 prospector for all files, but then having a pipeline doing more work (aka. grok pattern matching) to construct the extra fields and tags on the fly.

Which of the two designs would the most optimal ?

Thx,

Hello @Dominik,

I think the solution to your design will be related to your traffic and what grok you want to do, but there are always drawbacks concerning performance or flexibility.

Let's say that you have a thousand beats connected to your cluster that generate a lot of traffic and you have to apply grok expression on every event, grok expression are basically some sugar on top of a regular expression, depending on what you need to parse they can be slow and taxing more your cluster. Depending on capacity, it might slow down ingestion. You might want to test for your maximum ingestion rate with that pipeline.

Usually, FB on edge has a low memory/low CPU usage. I presume you want to add a prospector per file type (syslog, nginx), depending at how much file type we are talking about it might just be better to hardcode theses values on the prospector. Because these values are static and should never change, it easy to add the data to the event without less processing.

If you look at our module implementation, depending on the module we create more than one prospector :slight_smile:

Thanks

Thanks for that insight, it confirms a bit what I was thinking. My feeling was that I would prefer to limit load on the ES processes and keep the data (static) in FB.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.