Metricbeat CPU information is wrong


(Metricbeat) #1

Hello,

I'm playing with ELK/Beats to learn and see what is out there. I deployed metricbeat and I can confirm that data is making its way to ELK (In particular I send info to logstash). I'm in kibana UI and I see information there but surprisingly the information is not what I expect.

I was running load in the box to get:

0% CPU usage
50% CPU Usage
100% CPU Usage

Or in other words, 0 Full CPU's used, 1 Full CPU used or 2 Full CPU's used. I put this load purposely to confirm if Kibana was going to show the number properly.

I plotted a kibana visualization of system.processes.cpu.total.pct and also of system.cpu.total.pct. Somehow I see values of 0, 1 and 2 in the graphs, this is unexpected because I instead I was expecting values of 0%, 50% and 100%.

So these metrics are not saying what total percentage of CPU is busy (0 to 100%), instead they say how many CPU's are used (0, 1 2). This is wrong and unexpected. I checked many other CPU related KPI's around and is the same luck.

My OS is CentOS Linux release 7.4.1708 Kernel Linux apache 3.10.0-693.el7.x86_64 .

The version of ELK is 6.2.4 (the latest)
I have 2 CPU's:

[root@apache metricbeat]# cat /proc/cpuinfo |grep -i proc
processor : 0
processor : 1
[root@apache metricbeat]#

And among other tests, when I made 1 CPU busy, VMSTAT was showing 50% CPU idle (50% busy):

procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 1007640 2116 551528 0 0 0 0 1080 93 50 0 50 0 0
1 0 0 1007640 2116 551528 0 0 0 0 1079 88 49 1 50 0 0
1 0 0 1007640 2116 551532 0 0 0 0 1091 113 50 0 50 0 0
1 0 0 1007640 2116 551532 0 0 0 0 1078 89 50 0 50 0 0
1 0 0 1007640 2116 551532 0 0 0 3 1079 93 50 0 50 0 0
1 0 0 1007640 2116 551532 0 0 0 0 1071 84 49 0 50 0 0

However the output in the console for metricbeat was showing:

2018-05-10T20:43:49.764-0700 DEBUG [logstash] logstash/async.go:142 17 events out of 17 events sent to logstash host 192.168.1.109:5443. Continue sending
2018-05-10T20:43:58.763-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T03:43:58.762Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"name": "cpu",
"module": "system",
"rtt": 156
},
"system": {
"cpu": {
"cores": 2,
"nice": {
"pct": 0
},
"softirq": {
"pct": 0.001
},
"user": {
"pct": 0.9945
},
"idle": {
"pct": 0.9965
},
"irq": {
"pct": 0
},
"iowait": {
"pct": 0
},
"steal": {
"pct": 0
},
"system": {
"pct": 0.008
},
"total": {
"pct": 1.0035
}
}
},
"beat": {
"version": "6.2.4",
"name": "apache",
"hostname": "apache"
}
}

If you see, the raw data itself was showing Total pct of 1.0035 which is wrong.

So looks like somewhere a factor of 50 is missing (1.0035% * 50) = expected 50% usage.

I noticed Need to understand metricbeat cpu metrics, looks like somebody is seen same behavior.

Can you please point me what to do here?

Many thanks
Luis


(Noémi Ványi) #2

The percentages are between 0..1, so 1.0035 equals 100.35%.

CPU percentages reported by Metricbeat are not normalized by default. This means that Metricbeat sums the percentages of CPUs. You have 2 CPUs, so the percentages need to be divided by 2. If you do the math, you can see that 100.35%/2~50%.

If you add normalized_percentages to your config, Metricbeat does the normalization for you:

cpu.metrics:  ["percentages", "normalized_percentages"]

(Metricbeat) #3

Hi @kvch,

I did the change you suggested, I still see the system reporting in the scale of 1 instead of 100%:


2018-05-11T09:58:55.199-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T16:58:55.198Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"name": "cpu",
"module": "system",
"rtt": 172
},
"system": {
"cpu": {
"cores": 2,
"system": {
"pct": 0.007
},
"idle": {
"pct": 0.9965
},
"nice": {
"pct": 0
},
"softirq": {
"pct": 0.001
},
"iowait": {
"pct": 0
},
"steal": {
"pct": 0
},
"total": {
"pct": 1.0035
},
"irq": {
"pct": 0
},
"user": {
"pct": 0.9955
}
}
},
"beat": {
"name": "apache",
"hostname": "apache",
"version": "6.2.4"
}
}
2018-05-11T09:58:55.201-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T16:58:55.200Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"rtt": 125,
"name": "cpu",
"module": "system"
},
"system": {
"cpu": {
"irq": {
"pct": 0
},
"total": {
"pct": 1.004
},
"idle": {
"pct": 0.996
},
"nice": {
"pct": 0
},
"softirq": {
"pct": 0.001
},
"cores": 2,
"system": {
"pct": 0.007
},
"steal": {
"pct": 0
},
"user": {
"pct": 0.996
},
"iowait": {
"pct": 0
}
}
},
"beat": {
"name": "apache",
"hostname": "apache",
"version": "6.2.4"
}
}
2018-05-11T09:58:55.201-0700 DEBUG [publish] pipeline/processor.go:275 Publish event: {
"@timestamp": "2018-05-11T16:58:55.201Z",
"@metadata": {
"beat": "metricbeat",
"type": "doc",
"version": "6.2.4"
},
"metricset": {
"module": "system",
"rtt": 132,
"name": "cpu"
},
"system": {
"cpu": {
"idle": {
"pct": 0.996,
"norm": {
"pct": 0.498
}
},
"irq": {
"pct": 0,
"norm": {
"pct": 0
}
},
"nice": {
"pct": 0,
"norm": {
"pct": 0
}
},
"system": {
"pct": 0.007,
"norm": {
"pct": 0.0035
}
},
"iowait": {
"norm": {
"pct": 0
},
"pct": 0
},
"total": {
"pct": 1.004,
"norm": {
"pct": 0.502
}
},
"cores": 2,
"softirq": {
"pct": 0.001,
"norm": {
"pct": 0.0005
}
},
"steal": {
"pct": 0,
"norm": {
"pct": 0
}
},
"user": {
"pct": 0.996,
"norm": {
"pct": 0.498
}
}
}
},
"beat": {
"name": "apache",
"hostname": "apache",
"version": "6.2.4"
}
}


My config is like this:


  • module: system
    period: 10s
    metricsets:

    • cpu
      #- load
      #- memory
      #- network
      #- process
      #- process_summary
      #- core
      #- diskio
      #- socket
      processes: ['.*']
      process.include_top_n:
      by_cpu: 5 # include top 5 processes by CPU
      by_memory: 5 # include top 5 processes by memory
  • module: system
    period: 1m
    metricsets:

    • filesystem
    • fsstat
      processors:
    • drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'
  • module: system
    period: 15m
    metricsets:

    • uptime
  • module: system
    metricsets: cpu

  • module: system
    metricsets: [cpu]
    cpu.metrics: ["percentages", "normalized_percentages"]



(Metricbeat) #4

Any further feedback?

Is it a version issue, a config issue?

Thanks
Luis


(system) #5

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.