Filebeat include_lines performance v.s. grep

dontscrambleme · October 10, 2018, 5:59am

I use filebeat to harvest lines including keywords and send it to logstash for post processing.
But the time filebeat searching for the string is much longer than running grep in Ubuntu shell
I don't have number to show up but I can definitely 'feel' it
Did the beat team compare the filebeat inlcude_lines performance v.s. grep?

Here is my environment -

filebeat 6.4.1 in my Ubuntu docker container
ELK and Ubuntu filebeat are under the same network (created through docker-compose) running in the PC
each of message files size is around 1.1MB, around 12000 lines in it

filebeat configuration -
filebeat.inputs:

type: log
enabled: true
paths:
- /mymessagefiles///messages_*
include_lines: ['waiting for mykeyword.*']
close_eof: true
harvester_limit: 4096
scan_frequency: 600

output.logstash:
hosts: ["logstash:5044"]
index: 'filebeat_sit73'

path.data: /filebeat/data

steffens · October 12, 2018, 5:36pm

inculde_lines is a full regular expression, while grep by default does not use a regular expression.

In go the regex engine is not the most efficient. This is a known issue. Plus using a regular expression to match for a sub-string (+ back-tracking by the regex engine) is effectively similar to the naive string matching algorithm (which has about quadratic time complexity).

Assuming you use GNU grep, you will find it using quite some tricks and a much more efficient string matching algorithm. See here: https://lists.freebsd.org/pipermail/freebsd-current/2010-August/019310.html

Beats also need to create events, plus the input is copied into an internal buffer for dealing with potential file truncation. Lines are turned into events and only the final event will be matched. Such that it works correctly with

In Beats we do some analysis and optimisation of matchers if applicable. In this case the inefficient regex of a plain string should be turned into a more efficient sub-string match using different algorithms depending on the size of the input pattern.

Also keep in mind, grep only has to print found lines to the console. Filebeat creates and buffers events into batches. Only if the buffer is full or after some timeout will the buffer be published and events be forwarded to the outputs. Then it does some additional processing, encodes events with meta data to JSON, forwards them to Logstash. Logstash and Elasticsearch can add additional back-pressure, slowing down reading in filebeat. Once queues are full in filebeat due to back-pressure, it will stop processing any more input until Logstash/Elasticsearch have finished processing.

system · November 9, 2018, 5:36pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU Usage - Windows Beats filebeat	6	2134	July 8, 2016
Recommendations for parsing 1000's ~10MB files to backfill elasticsearch Beats filebeat	3	1063	April 26, 2019
Filebeat 1.1.0: Multiline Patterns Beats filebeat	12	1681	July 5, 2017
Filebeat to ouput which pattern matched from the include_lines list Beats filebeat	5	1305	July 12, 2017
Grok in filebeat? Beats filebeat	8	30855	December 27, 2016

Filebeat include_lines performance v.s. grep

Related topics