Filebeat NOT sending all matched log entries


(Div) #1

Hello I'm using FileBeats version 6.6.0 on both Windows and RHEL 7 (also tried on previous versions) and having some issues parsing a log file greater than 12MB (our application log file sizes are 65MB and 150MB).
We are looking for particular patterns in the log file and ship those lines into ElasticSearch Ingest Node.
Our log contains around 200 matching patterns. The filebeat was only able to ship the first 100 and pauses until another log file is generated before it ships the rest of the 100 patterns. It repeats this behavior on the secosome cases we nd log file as well e.g.:

logFile-1.log generates: (contains 200 patterns)

filebeat ships first 100 patterns.

logFile-2.log generates: (contains 200 patterns also)

filebeat ships another 100 patterns from logFile-1.log and 100 patterns from logFile-2.log

logFile-3.log generates: (contains 200 patterns also)

filebeat ships another 100 patterns from logFile-2.log and 100 patterns from logFile-3.log

Filebeat never ships all the patterns from a single log until another log is generated.

Nothing happens till next log file generated (some instances we were waiting for over 30 mins)

Anyone experienced similar issue?
Any help is greatly appreciated.


(Steffen Siering) #2

Filebeat has a default event limit of 10MB. See (max_bytes setting).

I don't really understand what you mean by 200 patterns. Is it this number of multiline events?
How long is the event filebeat stops sending?


(Div) #3

Thanks so much Steffen for the response.
we did some test runs after increasing the max_bytes to 104857600 (100 MB) unfortunately things were same :-(.
btw: patterns, I meant was, matching log lines / entries.
see below the filebeat config settings:

In this test run we were waiting for the single file (65 MB, 342K log entries) to be scanned completely (no log entries were added during the process).
As you see below, the matching records were found and send to Elasticsearch Ingest node roughly 25~30 mins interval. This is all stemmed from a single log file. It took 2hr 25min to scan through this 65MB log file.


(Steffen Siering) #4

Please do not share screenshots of logs and configs. They can be hard to read for other persons. Format logs and configs using the </> Button in the editor window.

By saying patterns I did assume you are using multiline. From this incomplete screenshot I can not tell if you are using multiline or not. Can you please include you complete input configs?

Do not use the Elasticsearch output when testing how fast filebeat can go. Better use the console output and wait for the expected number of events. Measure the time it takes to reach this number. Normally filebeat can go through files quite fast.
Once we're sure Filebeat is really working as required we can look into tuning downstream.


(David) #5

Hi Steffens,

Thank-you for such a quick response. Here is our input config. I had to remove the comments as the max characters in a single reply was 7,000. I hope you do not mind.

#=========================== Filebeat inputs =============================

filebeat.inputs:

- type: log
  enabled: true

  paths:
    - C:\app1-logs\test15\*.log
  include_lines: ["RaiseAlarm","Reason: Q.850;cause=47"]
  max_bytes: 104857600

#============================= Filebeat modules ===============================

filebeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

#==================== Elasticsearch template setting ==========================

setup.template.settings:
  index.number_of_shards: 3

#================================ General =====================================

queue:
  mem:
    flush.min_events: 0
    flush.timeout: 1s
    
#============================== Kibana =====================================

setup.kibana:

#============================= Elastic Cloud ==================================

#================================ Outputs =====================================

#-------------------------- Elasticsearch output ------------------------------
output.elasticsearch:
  hosts: ["http://es-host1fqdn:9200", "http://es-host2fqdn:9200"]
  index: "app1-log-%{+yyyy.MM.dd}"
  pipeline: "app1-log"
  
  
setup.template.name: "filebeat"
setup.template.fields: "fields.yml"
setup.template.overwrite: false
setup.template.pattern: "app1-log-*"
#----------------------------- Logstash output --------------------------------

#================================ Logging =====================================

logging.level: debug

#============================== Xpack Monitoring ===============================

We have tried measuring speed with console and it's the same result. With debugging on, we can see in the filebeat log output printing every single line of the app log we're parsing.

We are also not using multi-line.

Thanks,

David


(Steffen Siering) #6

Filebeat reads the log line by line and the. The include_lines setting is a regular expressions matching every single line.
As you don't have multliline I don't think you need to set max_bytes (unless you have +10MB content per line in your file).

Have you checked CPU usage?

What kind of machine are you using? VMs, bare metal?

Do you read from local disk or some remote storage?

The matching in beats tries to use a more optimal string matcher if it finds the regex in include_lines to be a plain string. The second pattern as is is no plain string for the regex parser. Some minor optimization to include_lines:

include_lines: ["RaiseAlarm","Reason: Q\.850;cause=47"]

Given that even console output is this slow I wonder if filebeat gets any resources or if available IO bandwidth is the problem.


(David) #7

We've tried on Windows Server VMs on enterprise hardware with no resource constraints (8 vCPUs etc) all display same results.

Most of the tests are done on my local machine (since I could replicate it) reading of a workstation-grade SSD and a 4-thread 6th gen i7 with 20GB memory. CPU usage hovers around the typical 8~13%.

As a comparison, if I use the same strings and plug it into a GUI text editor like TextPad v8, it takes less than 2 seconds.

I'll try adding the back-slash in the second pattern and report back.

UPDATE: It doesn't like the back-slash:
include_lines: ["RaiseAlarm","Reason: Q\.850;cause=47"]

Exiting: error loading config file: yaml: line 30: found unknown escape character


(Steffen Siering) #8

For regular expressions (or rather in general) use single quote strings. In YAML strings within double quotes are unescaped already:

include_lines: ['RaiseAlarm','Reason: Q\.850;cause=47']

With console output and multiple CPUs I would assume filebeat to take like 200 or 300% CPU, assuming IO bandwidth permits.
How does CPU change if you use console output but run with filebeat.exe ... > NUL?


(David) #9

Hi Steffen,

We did a few more tests, and found some interesting results.

As you see in the "include_lines" section, we are matching on two patterns. In order to test in a more controlled manner, we mocked up a log that contained only matched patterns, which was a total of 350,000 lines. The log was approx 177MB and Filebeat parsed through the entire log without any issues in a relatively quick time (1+min)

We proceeded with injecting regular unmatched log lines in-between this file totaling 310,000 lines (150,000 matches lines at the top, 10,000 unmatched lines in the middle, then another 150,000 matched lines at the end). Filebeat again had no issues with processing this file.

We started increasing the number of unmatched lines in-between the file to 100,000 lines totaling 400,000. This is where Filebeat chokes, and only captures the first 150,000 matched lines but appears to stop after that.

Our test today we repeated the pattern of 10 matched lines, followed by 100 unmatched lines until we reached a total of 400,000 lines. Filebeat processed the entire file quickly without any problems.

Based on all the above noted behavior, is there some kind of buffer we are hitting that prevents filebeat from processing the file? Keeping in mind we have already tried increasing max_bytes and harvester limits without success.

Is there a way to increase logging levels beyond debug that can show us where it is getting stuck?

Thanks,


(Steffen Siering) #10

I see. This is not about reading the file, but ratio of lines to be send vs. lines to be filtered out. Filebeat stores the last offset of the last batch of events in the registry file. Even if most events are dropped, the registry needs to be updated. That is, there are a many registry related IO operations, without having send any events yet. This create additional back-pressure from the local disk and slow down processing.

Maybe the filebeat.registry_flush setting can help here to reduce disk IO.
Can you rerun your tests (with 400,000 lines) with filebeat.registry_flush: 5s to confirm my hunch?


(David) #11

Hi Steffen, that setting worked! We were able to successfully parse through even a 450,000 line log.

We have a few config options we were playing around with, and wondering if we should turn them back off (flush.min and flush.timeout). What do you suggest?

queue:
  # Queue type by name (default 'mem')
  # The memory queue will present all available events (up to the outputs
  # bulk_max_size) to the output, the moment the output is ready to server
  # another batch of events.
  mem:
    # Max number of events the queue can buffer.
    #events: 4096

    # Hints the minimum number of events stored in the queue,
    # before providing a batch of events to the outputs.
    # The default value is set to 2048.
    # A value of 0 ensures events are immediately available
    # to be sent to the outputs.
    flush.min_events: 0

    # Maximum duration after which events are available to the outputs,
    # if the number of events stored in the queue is < min_flush_events.
    flush.timeout: 1s

Thanks,


(Steffen Siering) #12

For the flush settings I'd say it depends and you should do some testing regarding CPU usage.

Setting flush.min_events: 0 can reduce latency, as events are immediately availble to the outputs. This can potentially increase CPU usage and force the output to publish smaller batches. It depends how many events are available in which period of time (burstiness of events). When testing one wants to simulate the actual log writing.
Yet, if the queue is constantly 'full' due to backpressure it should be no less efficient then having the normal buffering, as the queue still buffers events if no output is available yet.

In common cases I'd not mess with the flush settings. Normall buffering and a delay of 1s is a good compromise. Not too much latency, but chances to reduce resource usage in case of bursts.

More interesting is the queue.mem.events setting. The queue only drops events when those have been acked by the outputs. It is some means to control total number of events you can have active in Filebeat. That is there is a connection between number of hosts/workers configured in your output, bulk_max_size and the queue size. For loadbalancing you want queue.mem.events = N * (worker * hosts) * bulk_max_size (with N==2 and ES output a next batch of events can be available to an output the moment you got the ACK). But again, you should test if you actually need to change this setting. If performance is already good enough, then don't tune.