Topbeat sometimes returns an invalid value for cpu.system_p on Windows


(Marco Ponton) #1

Hi,

I've been running Topbeats on servers for a PoC at a client for about 3 weeks. Every now and then, Topbeat returns an invalid cpu.system_p value. This of course does not look very good on a graph... So far I've seen it happen only 3 times on a single host (out of 3 hosts).

Here's the last example:

          "cpu": {
            "user": 18150,
            "user_p": 0,
            "nice": 0,
            "system": 20409,
            "system_p": 970881267037344770,
            "idle": 4365624,
            "iowait": 0,
            "irq": 0,
            "softirq": 0,
            "steal": 0
          },

I am unsure if an issue with topbeat.go calculations, sigar_windows.go, the system itself returning invalid values or cosmic rays :slight_smile: ...

Some info:

Beat version: 1.1.0
Operating System: Windows 2012 R2 with up-to-date patches
Configuration:

input:
  period: 10

  procs: [".*"]

  stats:
    system: true
    proc: true
    filesystem: true
    cpu_per_core: true

Beats sent directly to ElasticSearch, not to LogStash.

Any ideas/suggestions to investigate further?

Thanks and regards,

Marco


(Andrew Kroh) #2

Hi @MarcoPonton, can you post the immediately preceding cpu stat document to the one you already posted? This way we can look at the delta between the two and see if we can try and figure out where the calculation went wrong.


(Marco Ponton) #3

Hi Andrew,

Here are the 3 records (the invalid one, and the ones before and after it):

    "hits": [
      {
        "_index": "topbeat-2016.03.07",
        "_type": "system",
        "_id": "AVNRK6pt8UEWGpsSjr2O",
        "_score": 3.6366284,
        "_source": {
          "@timestamp": "2016-03-07T13:02:08.907Z",
          "beat": {
            "hostname": "_REMOVED_",
            "name": "_REMOVED_"
          },
          "count": 1,
          "cpu": {
            "user": 18150,
            "user_p": 0,
            "nice": 0,
            "system": 20410,
            "system_p": 0.05,
            "idle": 4365604,
            "iowait": 0,
            "irq": 0,
            "softirq": 0,
            "steal": 0
          },
          "cpus": {},
          "load": {
            "load1": 0,
            "load5": 0,
            "load15": 0
          },
          "mem": {
            "total": 4294496256,
            "used": 1564626944,
            "free": 2729869312,
            "used_p": 0.36,
            "actual_used": 895258624,
            "actual_free": 1252093952,
            "actual_used_p": 0.21
          },
          "swap": {
            "total": 0,
            "used": 0,
            "free": 0,
            "used_p": 0
          },
          "type": "system"
        }
      },
      {
        "_index": "topbeat-2016.03.07",
        "_type": "system",
        "_id": "AVNRK9UW8UEWGpsSjsAa",
        "_score": 3.6360867,
        "_source": {
          "@timestamp": "2016-03-07T13:02:18.907Z",
          "beat": {
            "hostname": "_REMOVED_",
            "name": "_REMOVED_"
          },
          "count": 1,
          "cpu": {
            "user": 18150,
            "user_p": 0,
            "nice": 0,
            "system": 20409,
            "system_p": 970881267037344770,
            "idle": 4365624,
            "iowait": 0,
            "irq": 0,
            "softirq": 0,
            "steal": 0
          },
          "cpus": {},
          "load": {
            "load1": 0,
            "load5": 0,
            "load15": 0
          },
          "mem": {
            "total": 4294496256,
            "used": 1562251264,
            "free": 2732244992,
            "used_p": 0.36,
            "actual_used": 895258624,
            "actual_free": 1252093952,
            "actual_used_p": 0.21
          },
          "swap": {
            "total": 0,
            "used": 0,
            "free": 0,
            "used_p": 0
          },
          "type": "system"
        }
      },
      {
        "_index": "topbeat-2016.03.07",
        "_type": "system",
        "_id": "AVNRK_wm8UEWGpsSjsKX",
        "_score": 3.6337204,
        "_source": {
          "@timestamp": "2016-03-07T13:02:28.908Z",
          "beat": {
            "hostname": "_REMOVED_",
            "name": "_REMOVED_"
          },
          "count": 1,
          "cpu": {
            "user": 18150,
            "user_p": 0,
            "nice": 0,
            "system": 20410,
            "system_p": 0.05,
            "idle": 4365644,
            "iowait": 0,
            "irq": 0,
            "softirq": 0,
            "steal": 0
          },
          "cpus": {},
          "load": {
            "load1": 0,
            "load5": 0,
            "load15": 0
          },
          "mem": {
            "total": 4294496256,
            "used": 1561849856,
            "free": 2732646400,
            "used_p": 0.36,
            "actual_used": 895258624,
            "actual_free": 1252093952,
            "actual_used_p": 0.21
          },
          "swap": {
            "total": 0,
            "used": 0,
            "free": 0,
            "used_p": 0
          },
          "type": "system"
        }
      }
    ]

Thanks!

Marco


(Andrew Kroh) #4

The root cause of the high system_p value is that the cpu.system value decreases by 1.

sys_delta = cpu_2.system - cpu_1.system = 20409 - 20410 = -1

Since Topbeat was not expecting a negative delta it uses an uint64 variable type to store the sys_delta value. Since this is an unsigned variable, the value becomes 18446744073709551615 instead of -1. This is why the system_p value is really large.

If I change the variable type used to store the "sys_delta" value then the resulting system_p value becomes -0.06. It doesn't make sense, but at least it's representative of the values used in the calculation.

I pasted the relevant code and your values into the Golang playground for demonstration.

So what's causing the cpu.system value to decrease? It's either Windows or the elastic/gosigar library. I suspect it's the library. There is some floating point arithmetic being used to calculate the values. Instead of using floating point I think it should use this method.

Sorry for the long post. Would you be able to test a development build if I make a change?


(Marco Ponton) #5

I'd be happy to test a development/instrumented build to help pinpoint the issue or simply confirm the issue is not recurring anymore.


(Andrew Kroh) #6

Great. I will post an update once a development build is available. Here's the pull request with the changes; it just needs reviewed before it can merge. See the my comment in the PR.


(Andrew Kroh) #7

The dev build with my changes is now ready. Download topbeat-5.0.0-nightlylatest-windows.zip from S3 and give it a try. Thanks


(Marco Ponton) #8

I will update all servers today and will get back to you in a few days to let you know how things look. Thanks!


(Andrew Kroh) #9

@MarcoPonton What's the verdict on the change? :thumbsup: or :thumbsdown:


(Marco Ponton) #10

@andrewkroh,

Sorry for the delay! It's been running on all servers (7) for over a week now without any issue. So it looks like :thumbsup:!

I'm keeping an eye on it, will report again next week just to confirm, but if all good next week I would consider the matter closed.

Thanks for your help with this!


(system) #11