Metricbeat on one of my Windows servers fails to start

I am running MetricBeat on about 30 Windows servers, and on one of them it's having trouble. Running it manually seems to work fine, although I get a whole bunch of "Skip process pid" messages due to missing pids, but I think that's fine.

If I start it as a service however, it appears to start correctly, start to read pids, and then just silently terminates. In the Service Manager, it seems to take ages to start, at which point the SM gives up with a timeout error.

I found a recent post similar to this, but it didn't have much info and seems to have been abandoned.

This server is one of our busier ones, with CPU averaging between 70% and 80%, but I don't see why that would cause the service to not start.

This is the debug log output. The "error getting process details" ones repeat about 60 times.

2016-11-16T11:24:13Z DBG  Disable stderr logging
2016-11-16T11:24:13Z INFO Home path: [C:\Beats\metricbeat] Config path: [C:\Beats\metricbeat] Data path: [C:\ProgramData\metricbeat] Logs path: [C:\Beats\metricbeat\logs]
2016-11-16T11:24:13Z INFO Setup Beat: metricbeat; Version: 5.0.0-rc1
2016-11-16T11:24:13Z DBG  Processors: 
2016-11-16T11:24:13Z DBG  Initializing output plugins
2016-11-16T11:24:13Z INFO Elasticsearch url: http://logstash:9200
2016-11-16T11:24:13Z INFO Elasticsearch url: http://elastic1:9200
2016-11-16T11:24:13Z DBG  configure maxattempts: 4
2016-11-16T11:24:13Z INFO Metrics logging every 30s
2016-11-16T11:24:13Z DBG  load balancer: start client loop
2016-11-16T11:24:13Z DBG  ES Ping(url=http://logstash:9200, timeout=1m30s)
2016-11-16T11:24:13Z DBG  load balancer: start client loop
2016-11-16T11:24:13Z DBG  ES Ping(url=http://elastic1:9200, timeout=1m30s)
2016-11-16T11:24:13Z INFO Activated elasticsearch as output plugin.
2016-11-16T11:24:13Z DBG  Create output worker
2016-11-16T11:24:13Z DBG  No output is defined to store the topology. The server fields might not be filled.
2016-11-16T11:24:13Z INFO Publisher name: ca-app-rc01
2016-11-16T11:24:14Z INFO Flush Interval set to: 1s
2016-11-16T11:24:14Z INFO Max Bulk Size set to: 50
2016-11-16T11:24:14Z DBG  create bulk processing worker (interval=1s, bulk size=50)
2016-11-16T11:24:14Z INFO Register [ModuleFactory:[system], MetricSetFactory:[apache/status, haproxy/info, haproxy/stat, mongodb/status, mysql/status, nginx/stubstatus, postgresql/activity, postgresql/bgwriter, postgresql/database, redis/info, redis/keyspace, system/cpu, system/diskio, system/filesystem, system/fsstat, system/memory, system/network, system/process, zookeeper/mntr]]
2016-11-16T11:24:14Z DBG  Skip process pid=0: error getting process state for pid=0: OpenProcess fails with The parameter is incorrect.
2016-11-16T11:24:14Z DBG  Skip process pid=4: error getting process state for pid=4: OpenProcess fails with Access is denied.
2016-11-16T11:24:14Z DBG  Skip process pid=560: error getting process state for pid=560: OpenProcess fails with Access is denied.
2016-11-16T11:24:14Z DBG  Ping status code: 200
2016-11-16T11:24:14Z INFO Connected to Elasticsearch version 2.4.1
2016-11-16T11:24:14Z DBG  Ping status code: 200
2016-11-16T11:24:14Z INFO Connected to Elasticsearch version 2.4.1
2016-11-16T11:24:15Z DBG  Skip process pid=656: error getting process state for pid=656: OpenProcess fails with Access is denied.
2016-11-16T11:24:29Z DBG  Skip process pid=4292: error getting process state for pid=4292: OpenProcess fails with Access is denied.
2016-11-16T11:24:33Z ERR Error getting process details pid=4880: error getting process arguments for pid=4880: could not get CommandLine: could not get Win32_Process WHERE ProcessId = 4880: Process not found
2016-11-16T11:24:34Z ERR Error getting process details pid=1392: error getting process arguments for pid=1392: could not get CommandLine: could not get Win32_Process WHERE ProcessId = 1392: Process not found
2016-11-16T11:24:34Z ERR Error getting process details pid=1948: error getting process arguments for pid=1948: could not get CommandLine: could not get Win32_Process WHERE ProcessId = 1948: Process not found

I'm recalling that earlier post on this issue. I recent made some improvements to the Windows code that might have some effect on this problem, but it wasn't a change to specifically address the timeout problem.

Could you please test the snapshot build of Metricbeat that has my changes (these will be released in 5.1)? And also post the output like you did previously.

If that version is still timing out then we can look at ways of speeding up the initialization to prevent the timeout of the Windows service.

Pretty much the same thing. It's interesting that I see the "ERR" lines in the log while the Service Manager is still showing the "starting" dialog. It sort of implies that MB is cheerfully trying to get process stats before telling Windows that it has actually started. A Windows Service should return the correct signals to the Service Manager before starting any actual processing.

2016-11-16T15:19:45Z INFO Home path: [C:\Beats\metricbeat] Config path: [C:\Beats\metricbeat] Data path: [C:\ProgramData\metricbeat] Logs path: [C:\Beats\metricbeat\logs]
2016-11-16T15:19:45Z INFO Setup Beat: metricbeat; Version: 6.0.0-alpha1
2016-11-16T15:19:45Z INFO Elasticsearch url: http://logstash:9200
2016-11-16T15:19:45Z INFO Elasticsearch url: http://elastic1:9200
2016-11-16T15:19:45Z INFO Activated elasticsearch as output plugin.
2016-11-16T15:19:45Z INFO Publisher name: ca-app-rc01
2016-11-16T15:19:45Z INFO Flush Interval set to: 1s
2016-11-16T15:19:45Z INFO Max Bulk Size set to: 50
2016-11-16T15:19:45Z INFO Register [ModuleFactory:[system], MetricSetFactory:[apache/status, beats/filebeat, beats/libbeat, docker/container, docker/cpu, docker/diskio, docker/info, docker/memory, docker/network, haproxy/info, haproxy/stat, mongodb/status, mysql/status, nginx/stubstatus, postgresql/activity, postgresql/bgwriter, postgresql/database, redis/info, redis/keyspace, system/core, system/cpu, system/diskio, system/filesystem, system/fsstat, system/memory, system/network, system/process, zookeeper/mntr]]
2016-11-16T15:19:45Z INFO Metricbeat process and system info: {"OSVersion":{"Major":6,"Minor":2,"Build":9200},"Arch":"amd64","NumCPU":4,"User":{"SID":"S-1-5-18","Account":"SYSTEM","Domain":"NT AUTHORITY","Type":1},"ProcessPrivs":{"SeAssignPrimaryTokenPrivilege":{"enabled":false},"SeAuditPrivilege":{"enabled_by_default":true,"enabled":true},"SeBackupPrivilege":{"enabled":false},"SeChangeNotifyPrivilege":{"enabled_by_default":true,"enabled":true},"SeCreateGlobalPrivilege":{"enabled_by_default":true,"enabled":true},"SeCreatePagefilePrivilege":{"enabled_by_default":true,"enabled":true},"SeCreatePermanentPrivilege":{"enabled_by_default":true,"enabled":true},"SeCreateSymbolicLinkPrivilege":{"enabled_by_default":true,"enabled":true},"SeDebugPrivilege":{"enabled_by_default":true,"enabled":true},"SeImpersonatePrivilege":{"enabled_by_default":true,"enabled":true},"SeIncreaseBasePriorityPrivilege":{"enabled_by_default":true,"enabled":true},"SeIncreaseQuotaPrivilege":{"enabled":false},"SeIncreaseWorkingSetPrivilege":{"enabled_by_default":true,"enabled":true},"SeLoadDriverPrivilege":{"enabled":false},"SeLockMemoryPrivilege":{"enabled_by_default":true,"enabled":true},"SeManageVolumePrivilege":{"enabled":false},"SeProfileSingleProcessPrivilege":{"enabled_by_default":true,"enabled":true},"SeRestorePrivilege":{"enabled":false},"SeSecurityPrivilege":{"enabled":false},"SeShutdownPrivilege":{"enabled":false},"SeSystemEnvironmentPrivilege":{"enabled":false},"SeSystemProfilePrivilege":{"enabled_by_default":true,"enabled":true},"SeSystemtimePrivilege":{"enabled":false},"SeTakeOwnershipPrivilege":{"enabled":false},"SeTcbPrivilege":{"enabled_by_default":true,"enabled":true},"SeTimeZonePrivilege":{"enabled_by_default":true,"enabled":true},"SeUndockPrivilege":{"enabled":false}}}
2016-11-16T15:19:45Z INFO SeDebugPrivilege is enabled. SeDebugPrivilege=(Default, Enabled)
2016-11-16T15:19:45Z INFO Connected to Elasticsearch version 2.4.1
2016-11-16T15:19:45Z INFO Connected to Elasticsearch version 2.4.1
2016-11-16T15:19:57Z ERR Error getting process details pid=11368: error getting process arguments for pid=11368: ProcArgs failed for pid=11368: could not get Win32_Process WHERE ProcessId = 11368: Process not found
2016-11-16T15:19:57Z ERR Error getting process details pid=6748: error getting process arguments for pid=6748: ProcArgs failed for pid=6748: could not get Win32_Process WHERE ProcessId = 6748: Process not found
2016-11-16T15:19:58Z ERR Error getting process details pid=6616: error getting process arguments for pid=6616: ProcArgs failed for pid=6616: could not get Win32_Process WHERE ProcessId = 6616: Process not found
2016-11-16T15:19:58Z ERR Error getting process details pid=12664: error getting process arguments for pid=12664: ProcArgs failed for pid=12664: could not get Win32_Process WHERE ProcessId = 12664: Process not found
2016-11-16T15:19:58Z ERR Error getting process details pid=8072: error getting process arguments for pid=8072: ProcArgs failed for pid=8072: could not get Win32_Process WHERE ProcessId = 8072: Process not found

Forgot to add, that it gets to that first ERR line REALLY quickly, so it does look like it's not sending a service signal correctly for some reason. Interestingly, on the other ~20 servers I have, it's not a problem - just on two.
They are all Windows Server 2012 R2 by the way. Pretty much identical builds.

Can you please open a new issue for this in the beats repo?

I'm pretty sure the problem is this line which requests all process stats in the module initialization phase. The New method should run quickly, but it's probably blocking other initialization (namely the goroutine that responds to the service manager signals).

Sure, no problem.

Done - Issue 3018.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.