Unusually high Metricbeat memory usage

I have metricbeat installed on a Windows Server 2016 Datacenter server that also has an Elasticsearch node.

I also have metricbeat installed on a Windows Server 2012 Standard server with an Elasticsearch node as well in the same Elasticsearch cluster as the 2016 server.

Both on Elastic Stack 7.1.1.

Only the system module is enabled and configured identically on both servers:

# Module: system
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.1/metricbeat-module-system.html

- module: system
  period: 10s
  metricsets:
    - cpu
    #- load
    - memory
    - network
    - process
    - process_summary
    - socket_summary
    #- core
    #- diskio
    #- socket
  process.include_top_n:
    by_cpu: 20      # include top 20 processes by CPU
    by_memory: 20   # include top 20 processes by memory

- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

- module: system
  period: 15m
  metricsets:
    - uptime

#- module: system
#  period: 5m
#  metricsets:
#    - raid
#  raid.mount_point: '/'

At this point metricbeat has been running for around 1 week, and the memory usage on the 2016 server is > 3GB:
image

On the 2012 server it is significantly lower:
image

Any clue as to why this is happening only on the 2016 server?
Area chart of the memory usage in the past week:

2 Likes

(post withdrawn by author, will be automatically deleted in 24 hours unless flagged)

I too have the same issue with every 7.x Metricbeat I've tried. The only thing I've changed it the elasticsearch host from the default .yaml files. I did try changing the check period from 10s to 60s and that causes a much slower increase in RAM usage over time.

Capture

I increased the period for the first metricset from 10s to 60s and it does appear to slow down the growth:
Top graph is the 2016 server, bottom graph is the 2012 server.

All of the drops are from restarting the metricbeat service.

I have also observed this, Metricbeat 7.1.1 running on Windows Server 2016 Data Center ramps up to > 3GB memory utilization.

I've installed Metricbeat 7.1.1 on a Windows Server 2016 Standard server (QA-MS2018) that has nothing running on it aside from serving as a Hyper-V host and it appears to also exhibit the same behavior. In under 24 hours it has grown to over 500MB.

I'll try on a Windows Server 2019 server too.

Server 2016 running Metricbeat 7.1.1, that's over the last 7 days. Have multiple hosts exhibiting the same problem, the drop is where I killed the service, not sure that it would have come right by itself but can't just wait around for that to happen on that server.

Edit: this time series shows only the Metricbeat process memory utilization, the server itself reached critical levels of memory utilization which alerted me to check and then end the process. Seems to just keep consuming and not releasing memory until there's none left.

Edit: this time series shows the WmiPrvSE.exe process using huge amounts of CPU on Windows Server 2008 R2 running Metricbeat 7.1.1. This is consistent across all 20 or so of these servers that I'm running and did not happen with Metricbeat 6.5.4 with the same config settings. This usage falls away entirely once Metricbeat service is stopped. So it looks like Metricbeat 7.1.1 also has a problem on Windows Server 2008 R2 except instead of consuming huge amounts of RAM, it thrashes the WMI process. I have not altered collection interval or what elements are monitored between Metricbeat versions. That's a LOT of CPU for a monitoring service to use all by itself - it's not good to monitor a server if the monitoring application itself causes the server resource exhaustion.

Note: this graph shows ONLY the WmiPrvSE.exe process by itself.

image

This below shows the total CPU over that same period.

image

Here's Windows Server 2012 R2 server that also receives high CPU utilization from Metricbeat 7.1.1, also did not have this problem when running 6.5.4, also did not alter config details. Graph below shows those two processes stacked as a % of total CPU.

That same server as I stopped Metricbeat:

image

That same server after I uninstalled Metricbeat 7.1.1 and installed Metricbeat 6.5.4 using the exact same config:

image

Preliminary results from running metricbeat on a Windows Server 2019 VM and a Windows 10 machine, that also has Elasticsearch node installed, for the last 24 hours shows no noticeable increase in memory consumption.

This issue may just be isolated to Windows Server 2016.

I configured one of the Windows Server 2016 servers to only send the process metricset and it appears to exhibit the same growth rate as when I had the default metricsets enabled.

# Module: system
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.1/metricbeat-module-system.html

- module: system
  period: 10s
  metricsets:
    #- cpu
    #- load
    #- memory
    #- network
    - process
    #- process_summary
    #- socket_summary
    #- core
    #- diskio
    #- socket
  process.include_top_n:
    by_cpu: 20      # include top 5 processes by CPU
    by_memory: 20   # include top 5 processes by memory

- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

- module: system
  period: 15m
  metricsets:
    - uptime

#- module: system
#  period: 5m
#  metricsets:
#    - raid
#  raid.mount_point: '/'

CPU usage on my systems:

Scaled to the number of cores, it doesn't appear to be eating too much CPU time... though these are servers with a high number of CPU cores.
QA-MS2018 has 24 cores/48 threads
QA-DM-HQS-2012 has 12 cores/24 threads.
DEV-AP-2016-DC has 8 cores/8 threads.

When I use system.process.cpu.total.pct instead of system.process.cpu.total.norm.pct it does appear to use quite a lot:

However, I did notice on my Windows 10 development machine that the CPU usage was somewhat high. It only has 4 cores/8 logical processors.


The initial portion was when I had the interval set to 10s. When I noticed the high CPU usage of WmiPrvSE.exe, I changed the interval to 60s and it seems to have lowered the usage slightly.

Actual available memory reported by Metricbeat 5.0-beta1 is a bit misleading from the user POV.
If you analyze the JSON data on ES (please see the attached screenshot) you can see that values of system.memory.free, system.memory.total and system.memory.

The memory leak in Metricbeat seems to have been fixed in Metricbeat 7.2.
MyServer On Metricbeat 7.1


MyServer On Metricbeat 7.2

Thanks [wisdomgt], I'll give that a test in my environment in the next week and see what happens with Server 2012 R2 and Server 2008 R2 and their CPU usage. In the meantime have rolled back to Metricbeat 6.5.4 since that was nice and reliable.

Edit: From the release notes, 7.2 looks like they fixed both problems of CPU use and memory leak....

Update: Metricbeat 7.2.0 is running smoothly in my environment across Server 2008 R2, Server 2012 R2, Server 2016 and Server 2019.

Can confirm, Metricbeat 7.2.0 seems to be leak free on my Windows Server 2016 system:

CPU Usage is also down from 7.1.1.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.