Kibana rule - raise alert when CPU is over 90% for the last 5 min

Hi gurus,

I'm new to Rules in Kibana so I need your help.

I need to raise an email alert when the CPU is constantly exceeding 90% for the past 5 minutes.

The way I configured the rule is the following:

The alert is being triggered even if there was only a short few seconds spike over 90% for the CPU in the past 5 minutes. Is this the normal behavior?

What I want is to trigger the alarm only if the CPU is staying over 90% for at least 5 minutes.

How can I write such a rule?

Thank you,
Catalin

Any replies guys?

Hi @catalin.bulancea

Hmm interesting...

Is that field absolute or percent?

Perhaps try Average...

What version are you on?

What exact rule are you using?

Are you using Group By?

Hi Stephen,

The field is absolute, i.e. 0.1 is 10%, 0.9 is 90%.
I am on version 7.17.4.
The rule I am using is:


So yes, I am using Group by, because there are multiple fields.subsystem that match the fields.system.

What if I'd use Min instead of Max?
That would mean: if the minimum of the CPU usage is above 90% for the last 5 mins, then the the Max and Average will be above 90% too and it will stay there for the entire 5 min duration.
Is my understanding correct?

Thanks,
Catalin

Hi @catalin.bulancea

Here is the way I think of it / understand it.

Say your host.cpu.usage is collected every 10s from your hosts.

And you have FOR THE LAST 5 Minutes as your criteria, so ~30 samples that are looked at for each 5 MIN Interval

For MAX : If 1 of those samples is above the Threshold and the other 29 are below. The Max over the 5 Min time frame IS met. (You only need 1 for the condition to be met.)

For MIN : If 1 of those samples is below the Threshold and the other 29 are are above The Min over the 5 Min time frame IS NOT met. (You only need 1 for the condition to not be met) ....

So yes, your assumption is correct but it is not the recommended approach because all it takes is 1 sample below the Threshold not to meet the criteria.

This is why the vast majority of users use AVG for the case you are describing.

If you are concerned perhaps change the window to the last 1 MIN

Hi Stephen,

Thank you for your explanations! It makes more sense now. So MIN is not the way to go. I will try the AVG and see how it goes.
Could you give me an example how AVG will behave with the 30 samples in the 5 min interval?

Thank you,
Catalin

For AVERAGE : It will calculate the average (arithmetic mean), so sum all the CPU percentages over the 5 minutes / 30 buckets in the 5 mins, simple direct average calculation Sum the Values / Count of Values.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.