Metricbeat WMI CPU usage


#1

We want to use metricbeat to monitor processes on our Windows servers.
The problem is that we have 200+ processes which seems to cause peformance issues:

  1. WmiPrvSE.exe consumes 90%+ of one CPU core.
  2. Metricbeat freezes completely for first couple of minutes after starting the processes.
  3. At least when using Topbeat 1.3.1 we see that its log is full of WARN Ignoring tick(s) due to processing taking longer than one period (we have period: 10s). Not sure if Metricbeat 5.2.1 still has the problem.
    We see that the high CPU usage is caused by obtaining processes' command line via WMI (as discussed in Windows Topbeat Causes High CPU Utilization in WMI Provider Host [Solved]).

It would be great if Metricbeat somehow allowed user to specify (s)he does not need the system.process.cmdline in order to avoid that dramatic WMI performance overhead.

Note this does not have the desired effect:

processors:

  • drop_fields:
    fields: [ system.process.cmdline ]

Also note it may be the case that the problem only manifests itself when processes have pretty long command lines (which our PROD processes do). So, the spawn_dummy_processes.bat batch file given below has some quick-and-dirty procedure of spawning processes with long command lines.

As a proof-of-concept we've patched the Metricbeat to not query WMI. Here are the results:

Original version

Patched version

spawn_dummy_processes.bat to use for testing

echo off
set childrenLifeDuration=180
set childrenCount=300
call :generateRandomString 300
echo Spawning %childrenCount% processes that will live for %childrenLifeDuration% seconds...
FOR /L %%X IN (1,1,%childrenCount%) DO call :spawn %%X
echo Press any key to terminate spawned processes prematurely...
pause > NUL
echo Killing processes...
FOR /L %%X IN (1,1,%childrenCount%) DO waitfor /SI XXX%%X > NUL 2>&1
echo Done.
GOTO :EOF

:spawn
::call :generateRandomString 400
:: add randomized preffix, so that processes have different command lines
set longString=%random% %longString%
start /MIN cmd "/c echo %longString%& waitfor /T %childrenLifeDuration% XXX%1"
::waitfor /T 1 ZZZZ > NUL 2>&1
GOTO :EOF

:generateRandomString
call :generateRandomStringImpl "longString" %1
set longString=%accumString%
set accumString=
GOTO :EOF

:generateRandomStringImpl
set accumString=%~1
set remainingSteps=%2
IF 0==%remainingSteps% GOTO :EOF
set /A remainingSteps = %remainingSteps% - 1
set accumString=%accumString% %random% %random% %random%
call :generateRandomStringImpl "%accumString%" %remainingSteps%
GOTO :EOF


(Andrew Kroh) #2

The drop_fields processors is applied after the event is created so it won't save on CPU cycles.You could limit what processes are collected with the processes: ["some regex"] option.

Metricbeat should be using a cached copy of the cmdline after the first fetch for a given PID. Are you seeing this reflected in your testing? But that won't help with the initial fetch.

Personally I would be ok with adding a processes.cmdline: true/false option to toggle the cmdline collection. Probably a better solution would be to find a way that doesn't require WMI to get the cmdline.


#3

Yep. That is understandable.

Does not solve our problem since we do need CPU usage info on each of those 200+ processes :frowning:

Just tried it: indeed in 3-4 minutes CPU usage is back to normal. :+1: Well, in fact this seems to be just good enough for us to use. I wonder why we experienced lots of WARN Ignoring tick(s) due to processing taking longer than one period with Topbeat...

[quote="andrewkroh, post:2, topic:69553"]
Probably a better solution would be to find a way that doesn't require WMI to get the cmdline.[/quote]
Totally agree.


(Andrew Kroh) #4

This message was logged anytime the collection took longer than 10 seconds or whatever the collection period was. I don't think there is an equivalent message in Metricbeat, but the same thing probably occurs on your first collection interval. Each MetricSet does its collection on it own goroutine so it cannot block the collection of metrics from other modules. And if the collection takes longer than 10 seconds it will just resume collection at the fixed interval.


#5

Can we add a feature request for this? While CPU usage indeed decreases in 3-4 minutes after processes are started there still can be - pretty rare I admit it - situations when we don't have those 3-4 minutes. So, if it's not a big change and there is a simple and clean way to add such a config option we'd like to have it.


(Andrew Kroh) #6

Yes, please open a request for it. It probably would be relatively easy to add such an option because the ProcArgs are requested on their own from the sigar library. So it would just mean adding a conditional around that call.


#7

Great! And thank you so much for prompt, friendly and professional responses! :slight_smile:


#8

JFYI here's the feature request: https://github.com/elastic/beats/issues/3249


(system) #9

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.