Metricbeat CPU information is wrong

Hello,

I'm playing with ELK/Beats to learn and see what is out there. I deployed metricbeat and I can confirm that data is making its way to ELK (In particular I send info to logstash). I'm in kibana UI and I see information there but surprisingly the information is not what I expect.

I was running load in the box to get:

0% CPU usage
50% CPU Usage
100% CPU Usage

Or in other words, 0 Full CPU's used, 1 Full CPU used or 2 Full CPU's used. I put this load purposely to confirm if Kibana was going to show the number properly.

I plotted a kibana visualization of system.processes.cpu.total.pct and also of system.cpu.total.pct. Somehow I see values of 0, 1 and 2 in the graphs, this is unexpected because I instead I was expecting values of 0%, 50% and 100%.

So these metrics are not saying what total percentage of CPU is busy (0 to 100%), instead they say how many CPU's are used (0, 1 2). This is wrong and unexpected. I checked many other CPU related KPI's around and is the same luck.

My OS is CentOS Linux release 7.4.1708 Kernel Linux apache 3.10.0-693.el7.x86_64 .

The version of ELK is 6.2.4 (the latest)
I have 2 CPU's:

[root@apache metricbeat]# cat /proc/cpuinfo |grep -i proc
processor : 0
processor : 1
[root@apache metricbeat]#

And among other tests, when I made 1 CPU busy, VMSTAT was showing 50% CPU idle (50% busy):

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 1007640 2116 551528 0 0 0 0 1080 93 50 0 50 0 0
1 0 0 1007640 2116 551528 0 0 0 0 1079 88 49 1 50 0 0
1 0 0 1007640 2116 551532 0 0 0 0 1091 113 50 0 50 0 0
1 0 0 1007640 2116 551532 0 0 0 0 1078 89 50 0 50 0 0
1 0 0 1007640 2116 551532 0 0 0 3 1079 93 50 0 50 0 0
1 0 0 1007640 2116 551532 0 0 0 0 1071 84 49 0 50 0 0

However the output in the console for metricbeat was showing:

2018-05-10T20:43:49.764-0700 DEBUG [logstash] logstash/async.go:142 17 events out of 17 events sent to logstash host 192.168.1.109:5443. Continue sending
2018-05-10T20:43:58.763-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T03:43:58.762Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"name": "cpu",
"module": "system",
"rtt": 156
},
"system": {
"cpu": {
"cores": 2,
"nice": {
"pct": 0
},
"softirq": {
"pct": 0.001
},
"user": {
"pct": 0.9945
},
"idle": {
"pct": 0.9965
},
"irq": {
"pct": 0
},
"iowait": {
"pct": 0
},
"steal": {
"pct": 0
},
"system": {
"pct": 0.008
},
"total": {
"pct": 1.0035
}
}
},
"beat": {
"version": "6.2.4",
"name": "apache",
"hostname": "apache"
}
}

If you see, the raw data itself was showing Total pct of 1.0035 which is wrong.

So looks like somewhere a factor of 50 is missing (1.0035% * 50) = expected 50% usage.

I noticed Need to understand metricbeat cpu metrics, looks like somebody is seen same behavior.

Can you please point me what to do here?

Many thanks
Luis

The percentages are between 0..1, so 1.0035 equals 100.35%.

CPU percentages reported by Metricbeat are not normalized by default. This means that Metricbeat sums the percentages of CPUs. You have 2 CPUs, so the percentages need to be divided by 2. If you do the math, you can see that 100.35%/2~50%.

If you add normalized_percentages to your config, Metricbeat does the normalization for you:

cpu.metrics:  ["percentages", "normalized_percentages"]

Hi @kvch,

I did the change you suggested, I still see the system reporting in the scale of 1 instead of 100%:


2018-05-11T09:58:55.199-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T16:58:55.198Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"name": "cpu",
"module": "system",
"rtt": 172
},
"system": {
"cpu": {
"cores": 2,
"system": {
"pct": 0.007
},
"idle": {
"pct": 0.9965
},
"nice": {
"pct": 0
},
"softirq": {
"pct": 0.001
},
"iowait": {
"pct": 0
},
"steal": {
"pct": 0
},
"total": {
"pct": 1.0035
},
"irq": {
"pct": 0
},
"user": {
"pct": 0.9955
}
}
},
"beat": {
"name": "apache",
"hostname": "apache",
"version": "6.2.4"
}
}
2018-05-11T09:58:55.201-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T16:58:55.200Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"rtt": 125,
"name": "cpu",
"module": "system"
},
"system": {
"cpu": {
"irq": {
"pct": 0
},
"total": {
"pct": 1.004
},
"idle": {
"pct": 0.996
},
"nice": {
"pct": 0
},
"softirq": {
"pct": 0.001
},
"cores": 2,
"system": {
"pct": 0.007
},
"steal": {
"pct": 0
},
"user": {
"pct": 0.996
},
"iowait": {
"pct": 0
}
}
},
"beat": {
"name": "apache",
"hostname": "apache",
"version": "6.2.4"
}
}
2018-05-11T09:58:55.201-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T16:58:55.201Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"module": "system",
"rtt": 132,
"name": "cpu"
},
"system": {
"cpu": {
"idle": {
"pct": 0.996,
"norm": {
"pct": 0.498
}
},
"irq": {
"pct": 0,
"norm": {
"pct": 0
}
},
"nice": {
"pct": 0,
"norm": {
"pct": 0
}
},
"system": {
"pct": 0.007,
"norm": {
"pct": 0.0035
}
},
"iowait": {
"norm": {
"pct": 0
},
"pct": 0
},
"total": {
"pct": 1.004,
"norm": {
"pct": 0.502
}
},
"cores": 2,
"softirq": {
"pct": 0.001,
"norm": {
"pct": 0.0005
}
},
"steal": {
"pct": 0,
"norm": {
"pct": 0
}
},
"user": {
"pct": 0.996,
"norm": {
"pct": 0.498
}
}
}
},
"beat": {
"name": "apache",
"hostname": "apache",
"version": "6.2.4"
}
}


My config is like this:


  • module: system
    period: 10s
    metricsets:

    • cpu
      #- load
      #- memory
      #- network
      #- process
      #- process_summary
      #- core
      #- diskio
      #- socket
      processes: ['.*']
      process.include_top_n:
      by_cpu: 5 # include top 5 processes by CPU
      by_memory: 5 # include top 5 processes by memory
  • module: system
    period: 1m
    metricsets:

    • filesystem
    • fsstat
      processors:
    • drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'
  • module: system
    period: 15m
    metricsets:

    • uptime
  • module: system
    metricsets: cpu

  • module: system
    metricsets: [cpu]
    cpu.metrics: ["percentages", "normalized_percentages"]


Any further feedback?

Is it a version issue, a config issue?

Thanks
Luis

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.