I use filebeat to harvest lines including keywords and send it to logstash for post processing.
But the time filebeat searching for the string is much longer than running grep in Ubuntu shell
I don't have number to show up but I can definitely 'feel' it
Did the beat team compare the filebeat inlcude_lines performance v.s. grep?
Here is my environment -
filebeat 6.4.1 in my Ubuntu docker container
ELK and Ubuntu filebeat are under the same network (created through docker-compose) running in the PC
each of message files size is around 1.1MB, around 12000 lines in it
filebeat configuration -
filebeat.inputs:
type: log
enabled: true
paths:
/mymessagefiles///messages_*
include_lines: ['waiting for mykeyword.*']
close_eof: true
harvester_limit: 4096
scan_frequency: 600
inculde_lines is a full regular expression, while grep by default does not use a regular expression.
In go the regex engine is not the most efficient. This is a known issue. Plus using a regular expression to match for a sub-string (+ back-tracking by the regex engine) is effectively similar to the naive string matching algorithm (which has about quadratic time complexity).
Beats also need to create events, plus the input is copied into an internal buffer for dealing with potential file truncation. Lines are turned into events and only the final event will be matched. Such that it works correctly with
In Beats we do some analysis and optimisation of matchers if applicable. In this case the inefficient regex of a plain string should be turned into a more efficient sub-string match using different algorithms depending on the size of the input pattern.
Also keep in mind, grep only has to print found lines to the console. Filebeat creates and buffers events into batches. Only if the buffer is full or after some timeout will the buffer be published and events be forwarded to the outputs. Then it does some additional processing, encodes events with meta data to JSON, forwards them to Logstash. Logstash and Elasticsearch can add additional back-pressure, slowing down reading in filebeat. Once queues are full in filebeat due to back-pressure, it will stop processing any more input until Logstash/Elasticsearch have finished processing.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.