A potential bug with Filebeat script processor

Hi team, thanks for your great work! I'm using filebeat 8.10.3, and seeing a potential bug with it.

I'm using the script processor to do some caching & filtering of log messages. To be specific, I'm using a LRU cache to filter messages seen recently, and passing messages downstream only when it's not hit in LRU cache. The lated part is shown here:

filebeat.inputs:
- type: filestream
  id: xxxx
  enabled: true
  paths:
    - /xxxx/rotate.log

processors:
  - script:
      lang: javascript
      source: >
      	// setup the LRU cache
        var lru = new LRU(50000);

        function process(event) {
            .....
            if (lru.write(msg)) {
            	// the message has already been in LRU cache, do not proceed
                event.Cancel(); 
                return;
            }

        }        

As you can see, with one single file rotate.log, it works as expected. Duplicate messages are filtered by LRU cache.

Since I need filebeat to work with rotating log files, I then change the log file from "/xxxx/rotate.log" to "/xxxx/rotate*.log". I can see in filebeat log that it's working with multiple log files at the same time, but it doesn't seem to respect LRU cache any more. With downstream receiver I can see duplicate messages close together.

Please let me know if it is a bug or mis-use. Thanks!

So I think the issue here comes from how the script processor works.

The script processor initiates a small number of JavaScript virtual machines which It uses to process events. I think the default number is four virtual machines. My guess is that under a small load you will likely be reusing the same virtual machine for every event.

Under higher load a single virtual machine can't keep up, and it begins using additional virtual machines. Once it has exhausted the pre-initialized virtual machines, I believe it will initialize additional virtual machines to handle more events.

You could verify this behavior by generating a uuid in your script alongside the lru cache and attaching that uuid to the event. This would help you better understand how many virtual machines are in use.

In summary, I think the issue is that you are expecting file beat to push every event through a single instance of the script, but filebeat generates multiple instances of the script to process events which will complicate your desired utilization of the script processor.

Here are the documented recommendations for deduplicating data using Filebeat and a couple potential downstream receivers. Deduplicate data | Filebeat Reference [8.12] | Elastic

Thanks @strawgate, that clarifies a lot. I do have busy logs, and a workaround would be to use one filebeat process for each log file.

Speaking of VMs, do you have any suggestion about increasing script processor's runtime performance? My operations in javascript are all memory bound (LRU, trie search, etc), and the performance still not reaching my expectation.

I'm not sure if Filebeat guarantees only using a single script instance if you keep it to one file.

Are you sure the performance limitation is from the script processor and not from the output or input?

I don't have a lot of experience with the script processor so I don't have any good suggestions there.

@strawgate I've tested with two filebeat instances (max_procs=4) working on two log files parallelly. From downstream recievers I can see LRU does work as expected in each instance. If possible please guide me to the related source code where I can refer to for confirmation.

As for the performance limitation I'm sure it's caused by script processor. The input of filebeat is local file, and output is kafka (bulk writing configured). Per testing results kafka's throughput can be much higher than current filebeat's output.

There is a config setting name max_cached_sessions that you could set to 1. Then all events get processed by the same JS VM. This also can makes it more of a bottleneck if you are doing parallel processing.

The in-memory cache processor could be a better tool for this job. It’s probably better suited to handling large caches and access across threads. Add cached metadata | Filebeat Reference [8.12] | Elastic

My guess is you'd cache some sort of unique key like what your LRU is doing and then use the cached metadata processor to populate a target field if that id was found in the cache and then you'd drop the event if there is data in that target field.

Thank you @strawgate . Will try your suggestions.