High CPU usage on Filebeat

Hi,

We are experiencing high CPU usage on our filebeat instances on Windows machines, we're seeing between 15%-50% usage on the affected instances but there are active nodes in which there is no CPU usage even though there is activity (~100 events/s).

I've managed to reproduce the problem on a machine with the following configuration:

CPU
Intel(R) Xeon(R) Gold 6152 CPU @ 2.10GHz

Maximum speed:	2,10 GHz
Sockets:	8
Virtual processors:	8
Virtual machine:	Yes
L1 cache:	N/A

Utilization	50%
Speed	2,10 GHz
Up time	31:09:27:09
Processes	288
Threads	7043
Handles	213881
Memory
20,0 GB

Slots used:	N/A
Hardware reserved:	0,5 MB

Available	5,9 GB
Cached	4,9 GB
Committed	17,1/26,0 GB
Paged pool	986 MB
Non-paged pool	472 MB
In use	14,0 GB
Filebeat.yml
#=========================== Filebeat prospectors =============================
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - e:\software\app\log\application_log-*.log
    tags: ['application', 'logs']
    exclude_files: ['application_log-infra-db*.log']
    json.message_key: message
    json.overwrite_keys: true
    json.keys_under_root: true
    fields:
      index_name: 'active-logs-application'
      pipeline_name: 'application_pipeline'
    ignore_older: 24h
    close_removed: true
    clean_removed: true
#    close_timeout: 5m

  - type: log
    enabled: true
    paths:
      - e:\software\app\log\trace\application_log-*.log
    tags: ['application', 'trace']
    json.message_key: message
    json.overwrite_keys: true
    json.keys_under_root: true
    fields:
      index_name: 'active-trace-application'
      pipeline_name: 'application_pipeline'
    ignore_older: 24h
    close_removed: true
    clean_removed: true
#    close_timeout: 5m

  - type: log
    encoding: 'latin1'
    enabled: true
    tags: ['gateway-logs']
    paths:
      - e:\software\gateway\log\gatewayvsc53.???.log.????.txt
    fields:
      index_name: 'active-logs-gateway'
      pipeline_name: 'gateway-trace-pipeline'
    ignore_older: 24h
    close_removed: true
    clean_removed: true
#    close_timeout: 5m

  - type: log
    encoding: 'latin1'
    enabled: true
    tags: ['auth-logs']
    include_lines: ['MMTraceId[[:blank:]]\[\w{16}\]$']
    multiline:
      negate: true
      match: 'after'
      pattern: '^$\n^\[\d{6}[[:blank:]]\d{6}\][[:blank:]]\[\d+\][[:blank:]].+$'
      flush_pattern: '^\[\d{6}[[:blank:]]\d{6}\][[:blank:]]\[\d+\][[:blank:]]ped[[:blank:]]\[\d+\].+$'
    paths:
      - e:\software\auth\log\*_operacao.log
    fields:
      index_name: 'active-logs-auth'
      pipeline_name: 'auth-input-trace'
    ignore_older: 24h
    close_removed: true
    clean_removed: true
#    close_timeout: 5m

setup.template.enabled: false
setup.ilm.enabled: false

#registry.flush: 10s
#max_procs: 1

logging.files.redirect_stderr: true
logging.to_files: true

output.elasticsearch:
  enabled: true
  hosts: ['???']
  username: '???'
  password: '???'
  index: '%{[fields.index_name]}'
  pipeline: '%{[fields.pipeline_name]}'

tags: ['windows']

About the filebeat.yml:
I tried to stay as close as possible to the production config, in the test machine the inputs are as follows:
application-logs* folder contains about 2k files, 1k being matched by the glob, in json format.
application-trace* folder contains 10 files, all matched by the glob, in json format.
gateway-* folder contains about 3k files, 2k being matched by the glob, in plain text.
auth-* folder contains about 40k files, 35k matched, in multiline plain text.


I started the process with the go profiler enabled, let in run for ~5m and then did a cpuprofile for 30s (http://localhost:9094/debug/pprof/profile?seconds=30)

CPU Profile

Sadly I'm not allow to upload the profiling data to a hosting service, but I could send them by email if one is provided.

This problem happens with Filebeat 7.0.1 and 7.2.0, we've not noticed any problems when using filebeat on the 6.x line, but it's been a long time since we've upgraded to 7.x, the elasticsearch cluster is running on 7.0.1.

Is there anything else I can do to help diagnose the issue?
Should I open a bug on github?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.