We are in the process of trying to deploy Topbeat, Filebeat, and WinLogBeat in our Windows-based web hosting environment. When the Topbeat service is running, one core of our servers is completely consumed by the WMI Provider Host (WmiPrvSE.exe). I feel fairly confident saying that Topbeat is the culprit causing this activity in the WMI process because the high CPU utilization is highly correlated with starting the Topbeat service. When the service is stopped the box returns to it's usual, low level of CPU activity.
The performance impact is more pronounced in our Server 2008R2 environments compared to our Server 2012R2 environment but i see the effect on both versions of the OS.
These are the things I've attempted to reduce the performance impact of the Topbeat agent:
Increase the polling interval - I set the "period" in the config file to values as high as 5 minutes (300 seconds) with not much impact. I also expected the high CPU to occur periodically on the same interval as the collection period, but this was not the case.
disabled file system stats - This data isn't terribly important to us so I tried setting "filesystem" to false, but this had no effect.
disabled individual process stats - We are VERY interested in this data, but wanted to see if disabling it had any impact on the CPU impact of the service.
disabling SCOM health agent - In the thought that there was some contention between the two monitoring services I tried disabling the SCOM health agent. This didn't change the CPU activity.
Do any other members of the community have experience deploying topbeat on Windows servers? I guess what I'm trying to determine is if this high WMI CPU utilization is related to a quirk of our environmental configuration or a conflict with another service running on these boxes.
I'm running a private build from the git repo circa Jan 27. I'm lousy with git so I'm not sure if there's a better way to identify the version of the code we used. I just did a "git log" on the local repo i used for the build and the last commit was from Jan 26.
We are using a private build because we need the process username which has been added to the output since the most recent release.
I also have to make a correction to my previous statements; the high CPU is only periodic when I set the polling interval to a long time (like 5 mins). I wasn't seeing the expected behavior because i failed to save the config file before restarting the service.
Other troubleshooting ideas I had:
Spin up a vanilla VM in Azure to eliminate any issues associated with our image or tools installed in our environment.
Compare with a released build, or a build from the latest source in github.
We can also deal with the issue by adding cores to the servers, but it feels "wrong" that the monitoring software is using more resources than the actual workload of the server.
I've one some more testing, but I'm not quite sure what to make of the results.
On one of our web servers I tested the public 1.1 release and it didn't trigger high CPU in the WMI Host.
I provisioned a vanilla Windows 2008R2 VM in Azure and tested both our private build and the 1.1 release. Initially, neither build resulted in high CPU on the system.
However, one difference between this host and our web server environments is that our web servers have many processes running on them. As a test I spun up 100 powershell sessions. With the high process count, both the 1.1 release of topbeat and our private build caused high CPU in the WMI provider.
I'm not sure what conclusion to draw from these results. To allow me to continue with the deployment I will throw a couple more cores at these boxes so we don't start getting CPU alerts, but I hope to learn more about how topbeats affects the performance of our environments and if anything can be done to reduce the impact.
Would love to hear from other folks if they have similar experiences or not.
Thanks for doing the investigation! A few quick thoughts:
The 1.1.1 release does not use WMI. So I wouldn't expect it to cause high CPU usage in the WMI process.
The master branch is using WMI to get the command line arguments of each process everything time it reports the process information. So with a period of 10 seconds and 100 processes, thats 1000 WMI queries.
If you have lots of processes then this will make the issue worse.
I think Topbeat should cache the command line arguments for a given process so that it doesn't need to make so many queries. Would you mind opening a new issue for this? I thinks it's a bug. Hopefully we can get a fix and you can retest prior to the next release.
Deployed the latest nightly build and the WMI CPU utilization is much lower with the caching in place. Thanks a ton for the fast resolution of this issue!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.