Metricbeat Kibana dashboards - "Overview" - intermittent results

I'd love some assistance with the below issues, is there anything I can do to diagnose?

Issue 1

I'm looking at a fresh "[Metricbeat Kubernetes] - Overview" dashboard.

The top-left stats (Nodes, Deployments, Desired Pods, Available Pods, Unavailable Pods) intermittently show. I left it refreshing on a 5 sec auto-refresh and here's some results - whether correct values were shown (y) or zeroes (n):

nnnynynynnnynynyn

This is when set to "Last 15 minutes".

"Last 30 minutes" seems to show values for Nodes and Deployments , but the pod counters are usually (75% of the time?) showing zero.

The same happens on other metricbeat dashboards for other stats.

Issue 2

Same dashboard. Intermittently, I get this:

Issue 3

"[Metricbeat Docker] Overview" dashboard now:

image

If I hide the legend, I see the data, without knowing what it relates to. If I show the legend, the data is distorted and out of bounds.

Issue 4

The top-right area, same dashboard:

image

Issue 5
"[Metricbeat System] Containers overview" dashboard, minor issue, but the links are cut off and there are scrollbars:

image

Issue 6
Probably just me, but... billions of nanocores? Could I ask why this is?

I'm not sure I can follow your Issue 1: Do you also share a screenshot here?

In general it seems part of the issues you are reporting is based on how Kibana works and for example scroll bars can appear based on screen size or browser. Agree not nice and we are trying to fix such things.

Could you share your Metricbeat and Kibana version and optional your Metricbeat config file?

Note: I move your topic to the Metricbeat category.

Hi @ruflin uflin,

Thanks for the response :slight_smile: Indeed I figured that some of those issues are CSS/layout/etc, but they are pretty annoying to work around, although the data is sound.

Here's a screen recording:

https://kierenj.tinytake.com/sf/Mjg2MzU5MV84NTk0ODk2

Does that make sense?

Oops, sorry - it's all v6.3.2. Kube Manifest (includes config): https://gist.github.com/kierenj/61293824a7a35515bdb6fda3b2a69f91

Hi @Kieren_Johnstone,

Thanks for the screen recording. I think the zero/non-zero issue (issue 1) that you are seeing is due to the refresh interval in kibana being every 5s while beats is reporting data every 10s (as configured via this setting: https://gist.github.com/kierenj/61293824a7a35515bdb6fda3b2a69f91#file-metricbeat-manifest-yml-L191).

As a test, can you try to change the refresh interval in kibana to 10s and see if the zero/non-zero issue goes away?

Thanks,

Shaunak

Hi, humm OK that would be surprising to me, I thought "last 15 minutes" means it gets data from now-15 minutes, regardless of refresh frequency? Either way, setting to a 10sec refresh interval doesn't fix it: in fact it seems to show 0 more often. No luck!

The metric visualizations for number of containers (etc) show the last bucket of data in a date histogram (not the total for the last 15 minutes). Unfortunately there is no way to guarantee that the last bucket of data it contains valid data, with in the team we refer to this is a partial bucket problem. Kibana is making a request for data that's not complete yet. There is a pending PR open that will fix this problem. With that PR in place we need to change the dashboard to do the calculation on the last minute of data instead of the last 30 seconds determined by the auto bucketing. @shaunak was correct about the issue but the fix is more complex.

Ah I see, that makes sense, thanks.

I don't suppose you'd know what's going on for "Issue 2" (seemingly, all data is grouped into the first timestamp histogram bucket for that one)?

Also: I see on the PR it's maybe down for 6.6 or similar. Am I just incredibly unlucky to have this hitting 50% of the time, is there some config change I can make to improve my situation, or are there lots of users with unusable Kube metricbeat dashboards for the next... few months/year? I'd guess I'm just experiencing it pretty intensely, but I'm not sure why?

I looked into the code for this visualization and it looks like there is a derivative involved. When this happens, there can sometimes be a "spike" in the visualization. I think this is what is going on here. @simianhacker can you confirm?

@shaunak @Kieren_Johnstone Yes... that PR I mentioned also trims off that part. It's a result of a partial bucket on the beginning part of the data too.

Ah, fantastic, thanks. Well, I mean, fantastic to know what it is! I am curious though, why I see this, and presumably the vast majority of people do not? Are there settings I can tweak to improve my situation without the PR ?

@simianhacker Sorry to @ you, but I'm really keen on getting visibility on our cluster.. surely this doesn't affect the majority of people, so might I ask if there's anything I can do to mitigate this problem - say, with my config?

Thank you!

Can anyone advise? Surely these dashboards must be successfully used by hundreds of people who don't run into this bug?

@ruflin @shaunak @simianhacker I'm sorry to @ you again - but is there anything I can do at all? Surely many others are using this successfully - is there nothing to be tweaked in the config? Is it really a random bug? Please help!

The best advice I can give you is to go into each visualization and make sure the interval (under panel options) is set to >=1m and make sure drop last bucket is set to yes. Also I would set the dashboard to last 1 hour and then save the time range with the dashboard. I would also set the refresh to a higher value then your collection interval. So if you are collecting metric every 10 seconds then I would set the dashboard refresh to 30s.

If you make the changes above and it still doesn't help them I would try and increase the interval to maybe show data for the last 5 minutes (>=5m) or tweak that to something reasonable for your setup. Issue you're seeing revolves around the delivery of the data and querying the data.

Thanks again. I've tried those things but still have the same outcome. In terms of the last paragraph, did you mean the Interval under Panel Options? If I set it to 5m, nothing seems to change - it's still intermittent.

So to be clear, if I look at the "Last 1 hour", set "Drop Last Bucket=yes", and interval to "5m" under Panel Options, this still happens.

I've checked and the latency of metricbeat and filebeat data from the (kubernetes) cluster seems to be under 2 seconds. Is this definitely what's going on? Anything else I can try?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.