Filebeat include_lines performance v.s. grep

I use filebeat to harvest lines including keywords and send it to logstash for post processing.
But the time filebeat searching for the string is much longer than running grep in Ubuntu shell
I don't have number to show up but I can definitely 'feel' it
Did the beat team compare the filebeat inlcude_lines performance v.s. grep?

Here is my environment -

  • filebeat 6.4.1 in my Ubuntu docker container
  • ELK and Ubuntu filebeat are under the same network (created through docker-compose) running in the PC
  • each of message files size is around 1.1MB, around 12000 lines in it

filebeat configuration -
filebeat.inputs:

  • type: log
    enabled: true
    paths:

    • /mymessagefiles///messages_*

    include_lines: ['waiting for mykeyword.*']
    close_eof: true
    harvester_limit: 4096
    scan_frequency: 600

output.logstash:
hosts: ["logstash:5044"]
index: 'filebeat_sit73'

path.data: /filebeat/data

inculde_lines is a full regular expression, while grep by default does not use a regular expression.

In go the regex engine is not the most efficient. This is a known issue. Plus using a regular expression to match for a sub-string (+ back-tracking by the regex engine) is effectively similar to the naive string matching algorithm (which has about quadratic time complexity).

Assuming you use GNU grep, you will find it using quite some tricks and a much more efficient string matching algorithm. See here: https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

Beats also need to create events, plus the input is copied into an internal buffer for dealing with potential file truncation. Lines are turned into events and only the final event will be matched. Such that it works correctly with

In Beats we do some analysis and optimisation of matchers if applicable. In this case the inefficient regex of a plain string should be turned into a more efficient sub-string match using different algorithms depending on the size of the input pattern.

Also keep in mind, grep only has to print found lines to the console. Filebeat creates and buffers events into batches. Only if the buffer is full or after some timeout will the buffer be published and events be forwarded to the outputs. Then it does some additional processing, encodes events with meta data to JSON, forwards them to Logstash. Logstash and Elasticsearch can add additional back-pressure, slowing down reading in filebeat. Once queues are full in filebeat due to back-pressure, it will stop processing any more input until Logstash/Elasticsearch have finished processing.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.