Topbeat / top command line discrepancies

vchav73 · June 8, 2016, 6:13pm

I am using Topbeat with ELK to monitor CentOS 6.8 server performance. I am trying to recreate the % idle value reported from the top command:

Cpu(s): 2.7%us, 1.0%sy, 2.8%ni, 90.5%id, 2.5%wa, 0.0%hi, 0.2%si, 0.3%st

In logstash I have this filter:

ruby {
  code => "event['cpu']['total'] = event['cpu']['user'] + event['cpu']['nice'] + event['cpu']['system'] + event['cpu']['idle'] + event['cpu']['iowait'] + event['cpu']['irq'] +  event['cpu']['softirq'] + event['cpu']['steal']"
  code => "event['cpu']['idle_p'] = 100.0 * event['cpu']['idle'] / event['cpu']['total']"
}

When I compare the top command line output to my calculated idle_p value over time I see that the calculated value is far more stable relative to what I see on the command line. On the command line I see swings of +/-10% over the a few seconds, while the calculated value swings +/-0.1% over the course of an hour. It seems that cpu.idle hardly changes.

I am wondering why the swings I see from the command line are not seen in the cpu.idle value. Is there some kind of averaging taking place when Topbeat samples? I confirmed using Topbeat logs that the cpu.idle values are what's being reported by Topbeat and not due to some post processing by Logstash.

steffens · June 9, 2016, 1:21pm

what's the reporting interval in topbeat vs top? Default is 10 seconds. That is cpu usage (total time process got CPU allocated for) is collected every 10 seconds. The effect of sampling less often (so called aliasing) is some kind of averaging out usages.

vchav73 · June 10, 2016, 4:14am

I ran "top -d 10" to match Topbeat's reporting interval of 10 seconds, but I see the same thing. The Topbeat value is far less variable.

Any suggestions where to go from here? The differences we see make us uncomfortable with relying on what Topbeat is reporting. Interestingly, we also see differences between df and the file system metrics reported by Topbeat.

monica · June 10, 2016, 10:34am

Is there a big difference between the values that you get with the top command and the ones from Topbeat? The values cannot be the same, but they should differ only a bit. This happens because the way the values are calculated is different in the top command on some platforms and the interval cannot be the same one, even if the interval is 10 seconds.

vchav73 · June 10, 2016, 7:34pm

The command line and Topbeat outputs differ substantially. I captured both top command line output and Topbeat output below. The samples are from the same one minute time period. Topbeat and command line top were both sampling every 10 seconds.

top -b -d 10
Cpu(s): 2.7%us, 1.6%sy, 1.7%ni, 79.5%id, 13.5%wa, 0.0%hi, 0.0%si, 1.0%st
Cpu(s): 1.5%us, 1.6%sy, 2.6%ni, 47.4%id, 45.6%wa, 0.0%hi, 0.1%si, 1.3%st
Cpu(s): 5.8%us, 2.2%sy, 3.5%ni, 50.4%id, 36.4%wa, 0.0%hi, 0.0%si, 1.7%st
Cpu(s): 5.8%us, 1.7%sy, 3.1%ni, 46.9%id, 40.7%wa, 0.0%hi, 0.1%si, 1.7%st
Cpu(s): 3.2%us, 2.7%sy, 2.6%ni, 31.0%id, 58.5%wa, 0.0%hi, 0.1%si, 2.0%st
Cpu(s): 6.5%us, 2.4%sy, 2.8%ni, 18.4%id, 68.3%wa, 0.0%hi, 0.1%si, 1.6%st
Cpu(s): 2.0%us, 1.8%sy, 6.7%ni, 55.5%id, 32.2%wa, 0.0%hi, 0.0%si, 1.8%st

Topbeat:
June 10th 2016, 12:12:25.420 79.512
June 10th 2016, 12:12:15.420 79.512
June 10th 2016, 12:12:05.420 79.513
June 10th 2016, 12:11:55.420 79.513
June 10th 2016, 12:11:45.420 79.513
June 10th 2016, 12:11:35.433 79.514

monica · June 13, 2016, 12:16pm

Thank you for your detailed information. Indeed there is a bug in the reportedidle value. Please open an issue in Github: https://github.com/elastic/beats/issues, so you can easily track the evolution of the bug.

vchav73 · June 13, 2016, 3:12pm

I submitted the issue: https://github.com/elastic/beats/issues/1841

system · June 29, 2016, 6:13pm

This topic was automatically closed after 21 days. New replies are no longer allowed.