Errors in metric beat log

Hi,

I'm running Metricbeat 6.2.2-1 on RHEL 7.5 with the following kernel version.

# uname -r -v
3.10.0-862.11.6.el7.x86_64 #1 SMP Fri Aug 10 16:55:11 UTC 2018

I'm seeing the below errors in metricbeat log.

2018-09-07T22:28:01.829-0600 ERROR process_summary/process_summary.go:79 Unknown state <88> for process with pid 20913
2018-09-08T06:08:01.933-0600 ERROR process/process.go:454 Error getting process details. pid=10808: error getting process mem for pid=10808: no such process
2018-09-08T09:53:31.785-0600 ERROR process_summary/process_summary.go:79 Unknown state <88> for process with pid 20679
2018-09-08T16:48:41.857-0600 ERROR process/process.go:454 Error getting process details. pid=2508: error getting process file descriptor usage for pid=2508: open /proc/2508/fd: no such file or directory
2018-09-09T02:34:01.949-0600 ERROR process/process.go:454 Error getting process details. pid=32629: error getting process arguments for pid=32629: read /proc/32629/cmdline: no such process

Can someone tell me why I get these errors ?

Hi @Ihjaz :slight_smile:

Can you share your metricbeat configuration? I feel like you are trying to monitor something that is constantly dying.

Metricbeat look for information in /proc to gather metrics. Also ensure that it's not a permissions issue and that you're running metricbeat with sudo

Hi Mario,

I've enabled the system module with the following content. All other modules are disabled.

- module: system
  metricsets:
    - cpu             # CPU usage
    - filesystem      # File system usage for each mountpoint
    - fsstat          # File system summary metrics
    - load            # CPU load averages
    - memory          # Memory usage
    - network         # Network IO
    - process         # Per process metrics
    - process_summary # Process summary
    - uptime          # System Uptime
    #- core           # Per CPU core usage
    - diskio         # Disk IO
    #- socket         # Sockets and connection info (linux only)
  enabled: true
  period: 10s
  processes: ['.*']

  # Configure the metric types that are included by these metricsets.
  cpu.metrics:  ["percentages"]  # The other available options are normalized_percentages and ticks.
  core.metrics: ["percentages"]  # The other available option is ticks.

Also metricbeat service on the server is started by the root user.

# ps -ef | grep metricbeat
root     28469     1  1 Sep07 ?        01:02:56 /usr/share/metricbeat/bin/metricbeat -c /etc/metricbeat/metricbeat.yml -path.home /usr/share/metricbeat -path.config /etc/metricbeat -path.data /var/lib/metricbeat

I'm not seeing any error today but on 6th and 7th Sept I was seeing the 'Unknown state <88> error' constantly as shown below.

2018-09-06T16:09:07.153-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 17179
2018-09-06T17:17:07.128-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 1130
2018-09-06T17:27:07.132-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 5740
2018-09-06T17:47:07.135-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 14727
2018-09-06T18:07:07.142-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 24452
2018-09-06T18:08:07.148-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 26307
2018-09-06T18:48:07.131-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 12059
2018-09-06T18:57:07.137-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 14872
2018-09-06T18:59:07.142-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 28334
2018-09-06T19:05:07.137-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 16320
2018-09-06T19:41:07.140-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 17912
2018-09-06T19:48:07.147-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 7750
2018-09-06T19:52:07.144-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 24827
2018-09-06T20:03:07.145-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 30794
2018-09-06T20:29:07.134-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 5402
2018-09-06T20:51:07.138-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 17802
2018-09-06T21:06:07.136-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 9009
2018-09-06T22:23:07.151-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 31005
2018-09-06T23:01:07.132-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 13664
2018-09-06T23:22:07.143-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 25080
2018-09-06T23:22:07.143-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 25081
2018-09-07T00:10:07.133-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 12164
2018-09-07T00:17:07.267-0600    ERROR   process/process.go:454  Error getting process details. pid=1944: error getting process file descriptor usage for pid=1944: read /proc/1944/limits: no such process
2018-09-07T01:12:07.135-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 11312
2018-09-07T01:12:07.135-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 11313
2018-09-07T01:39:07.137-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 19529
2018-09-07T02:19:07.131-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 6076
2018-09-07T02:30:07.142-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 12342
2018-09-07T02:31:07.133-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 13946
2018-09-07T02:46:07.136-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 5073
2018-09-07T03:28:47.288-0600    ERROR   process/process.go:454  Error getting process details. pid=5532: error getting process file descriptor usage for pid=5532: open /proc/5532/fd: no such file or directory
2018-09-07T03:30:07.134-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 7928
2018-09-07T03:37:07.143-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 30216
2018-09-07T04:16:07.136-0600    ERROR   process_summary/process_summary.go:79   Unknown state <88> for process with pid 14694

I'm not sure about the root of the problem but as far as I can see Unknown state <88> for process with pid means that the mentioned pid was dead.

state 88 is ASCII state of a PID process which in this case is X state = dead as you can see, for example, in this table http://thelinuxstuff.blogspot.com/2012/08/process-state-codes-in-ps-output.html

Just trying to progress over your issue :slight_smile:

Hi,

So if the process is dead, we should not be seeing their ids in the /proc folder right.

Is metricbeat caching the list of previously read process ids?

Hi,

Does anyone here know why metricbeat would try to get the summary of a process which is already dead?

Sorry for the late reply.

We haven't seen yet that metricbeat tries to fetch information about dead processes. Maybe you can share your metricbeat configuration and we can try to take a look.

Hi Mario,

As I mentioned above I only have the system module enabled. I have shared the content of the module above.
Is there any other configuration you are looking for?

Hi @Ihjaz

Sorry, I completely overlooked at the config. So as far as I could read, it is possible that you can see dead processes but it's very unlikely.

Even if metricbeat is "caching" the PID, as soon as it tries to get the state from the OS, it won't find the file in /proc/[pid]/status so it won't get a 88 status either. This is easy to check, just execute in a shell:

sleep 30 &
[1] 9999
cat /proc/9999/status
...
State:	S (sleeping)
...

Wait until the process has finished after 30 seconds and repeat the cat program, you'll naturally get a cat: /proc/9999/status: No such file or directory.

There is not special magic within metricbeat to cache PID process between calls. It must be somethiing with the way that RHEL 7.5 deals with removing processes information from proc folder.

So the only next step I can think about now is to lower or raise the check period, just to check if the error messages are reduced. We are also considering to simply not log this message because it provides little valuable information anyways.

I hope it helps :slight_smile:

Thanks Mario. This was really helpful.

Hi @Ihjaz :slight_smile:

We have just merged and enhancement that will remove the error messages and, instead, count them as the value "dead" within the metricset event. Also, we will print the returned character instead of the ASCII code.

Please, report if you see any issue https://github.com/elastic/beats/pull/8275

Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.