Bytes per second - Is it possible?

(Sjaak) #1

Hi there,

I've been fighting this for a couple of days but I can't figure it out.

I'm feeding netflow data into Elastic and obviously I want to create a nice bandwidth graph that shows utilization in bytes per second.

The problem is I can't get this to work. Reading https://github.com/elastic/kibana/issues/4646 it appears to proper logic is missing from Kibana/timelion.

In that thread it says timelion's scale_interval should do the trick but it doesn't.

Is what I have now but looking at this in e.g. a 15 minute time fame while downloading a large test file results in big spikes every minute or so because the netflow records for the test download are only coming in every minute or so.

What Kibana/timelion should be doing is sum all matching records over the period of one minute and divide it by 60.

Setting the interval to 1 minute and applying the code above appears to give the correct results and does away with the one minute spikes but this way creates too many buckets on long time periods so is unusable.

(Sjaak) #2

Attaching a screenshot to make things more clear.

The top screenshot uses a one minute interval and then divides by 60. The bottom screenshot uses the scale.interval.

I want scale.interval to produce the same result as the top screenshot.

Visual Builder, graph per second, with only per minute data
(Peter Pisljar) #3

actually, this is the default way charts are generated in visualize (not itmelion, normal bar or line chart)

if you use date_histogram agg in the line chart and choose to have 1 minute interval (but i suggest to leave the interval on auto if you plan to look at different time ranges (day, month, year) and as a metric show average out bytes it should produce what you are looking for.

also with kibana 5.4 we are introducing two new things that will make this much more adjustable:

  • pipeline aggs, specifically average bucket agg will be able to help here quite a lot
  • time series visual builder .... with this one it should be a breeze to do this and much more.

regarding timelion ... you should scale_interval(60s) i guess

(Christian Dahlqvist) #4

scale_interval() works on each interval, so if you zoom in and get an interval that is less than 1 minute, there will be empty buckets as your records only arrives once per minute and you will see what you see. If you used a 1 minute interval in the bottom graph, like you are doing in the top one, I suspect you will get the same result.

(Sjaak) #5

Unfortunately that won't work for netflow streams as there are multiple streams every minute (though the large ones only come in every minute or so) so you just get the average of all those streams which means in my test it shows something like 5MB per minute instead of around 900kbps which was the actual transfer speed. It will also create a parabolic graph with values going from 0 to the one minute average and back to zero again.

[quote]also with kibana 5.4 we are introducing two new things that will make this much more adjustable:

  • pipeline aggs, specifically average bucket agg will be able to help here quite a lot
  • time series visual builder .... with this one it should be a breeze to do this and much more.[/quote]

That sounds like something that would make me very happy :slight_smile: Where can I find the ELK release dates? Is it near?

This doesn't work. I think there is something wrong with the auto interval logic, it appears that this is ignored when drawing graphs.

scale_interval(1s) gives the correct results if I set the interval to 1 minute. If I set it to auto I get the spikes. Changing scale_interval to 60s, 1m, 10m or whatever gives the exact same spikes.

I assume the way auto interval in combination with scale_interval works is that scale_interval receives some kind of interval value from auto interval depending on the time frame (e.g. 1m for 15 minutes, 10m for 1 hour, 1h for 24 hours etc) and then is smart enough to figure out on its own that if the interval value is 10m and interval_scale is 1s, it needs to divide the SUM of 10 minutes worth of data by 600 and the resulting value is what needs to be displayed for 10 minutes.

It looks like its kinda doing the calculation but not drawing the graph over the auto interval period.

Please see what I wrote above. 1s, 1m, 10m or any other scale_interval doesn't change the graph. The only thing that changes the graph is by setting the interval to a hard value instead of auto which makes me think something isn't right with the auto interval value.

At least on shorter time frames. Looks like the cut off time for this issue is 2.5 hours. Once I select a time span over 2.5 hours it displays graphs correctly, below 2.5 hours I get the spikes.

(Peter Pisljar) #6

we don't have a fixed release dates but 5.4 should be released sometime in may.

that might actually be a bug, i will try to investigate further.

(Sjaak) #7

Around May should work for me. I plan on demoing ELK to management around that time and it would be best to have a fully functional graph or at least know its a bug that will be solved.

(Rashid Khan) #8

This comes up pretty often, so I'll dive into the nitty gritty on it...

Caller: Help! I only get data once per minute but I want to show a per second rate!
Me: Congratulations caller number 8! You're getting backstage passes to Saturday's "Sponges and Gum" show! Er, wait, different radio program. This is the data one right?

Ok, you're getting data once per minute, and you've dropped into Timelion and you've got this stupid looking chart.

It's calculating an interval of less than 1 minute--1 second in this case--because you're only looking at 15 minutes of data. So Timelion says "Ah, I can show you a nice high resolution" chart. You only get data every minute, but Timelion is asking Elasticsearch for per-second data. So once every 60 seconds, there is a big number, and no documents in between. The sum:bytes_per_minute is 0, zip, zilch, nada, nothing for 59 out of 60 seconds, so timelion plots that big fat zero.

But as far as you're concerned, it isn't 0, its just that you only reported at the end of the minute. We need to tell timelion "If there are no documents in the bucket, the number isn't 0, it's just not known yet, aka, null". The keyword in that statement: if. Timelion does if.

.es(_exists_:bytes_per_minute).if(eq, 0, null, .es(metric="sum:bytes_per_minute"))

This says: Ask Elasticsearch for a count of documents, over time, in which the bytes_per_minute field is present: _exists_:bytes_per_minute. If that count equals (eq) 0, set the bucket to null, otherwise, set the bucket to the value of .es(metric="sum:bytes_per_minute")), that is, our original expression from the first chart.

Since there are now a bunch of null buckets, I'll need to set this chart to points() for the moment, as lines won't connect over null buckets...we'll get to that in a moment.

It might not look like it, but this is progress. The chart on the left doesn't have all the zero values of the chart on the right, but we need to connect the dots by filling in the null values. Timelion's .fit(average) could do this for us:

Oooo, it lines up nicely with our original. Oh, that looks good right? But it's not. It's actually very wrong. We've invented bytes that didn't happen! Look what happens when we apply .cusum() to total up all these bytes:

We can't just interpolate like that. Drawing those straight lines between the points is a lie, and not your usual little-line-chart lie; its a big fat everything-is-totally-wrong-now lie. No, what we need is to distribute that total of bytes every 60 seconds, into the other 59 buckets. We can use .fit(scale) for that:

Now the chart on the left doesn't look as good, but looks aren't everything... and don't judge a book by its cover... and a bird in the hand. Hey, leave that bird alone. Where were we? Oh right, charts. The chart on right reveals that what we've done is actually quite accurate. The teal line now represent the per-second data, while the red continues to be per minute. The cumulative sum of the post-fit() series now tracks the original quite well and setting the interval drop down does a nice job of always showing us the data at the requested resolution, even if it has to correct it because you don't have frequent enough data.

Fit just right

But what if you want to see a bytes per second chart? No matter what interval is selected? We can now employ scale_interval(1s). Below we can see that that chart's y-axis extents and shape stay about the same as we change the interval. This is because we're scaling the chart to always represent a 1 second rate.

tipping the scale_interval

So there you go. Now you know how to normalize sparse data, as well as bring it back to a per interval rate, and hopefully you can see how you might use .if(), .fit() and .scale_interval() in other problems too. If you need more ideas, check out this blog post about Timelion conditionals: Time series If-Then-Else with Timelion

(Sjaak) #9

Rashid, thank you for the extensive write up but unfortunately this doesn't solve my problem.

The problem isn't so much that there is no (null) data between intervals, there is data but the way it's graphed when the interval falls below 1m is wrong.

What happens in the case of the netflow data I'm collecting:

  1. I start a large download.
  2. Netflow sends the records for this large download every minute or so (actual time depends on various factors).
  3. So for example every minute there is a record that has out_bytes 50MB.
  4. BUT, in the meantime other flows are still coming in, for example web browsing etc. These records might only show 50KB of usage.
  5. When the interval gets below 1m I get spikes. I did a quick test and at 30s intervals it gets really bad, up to 50s is OK'ish. This is with scale_interval(1s).

I'm not sure about the logic behind the auto interval or what is causing the problem but if I could somehow tell Kibana to NOT use intervals shorter than 1m it might solve my problem.

Is it possible to set a minimum auto interval?

edit: Similar to this. https://github.com/elastic/kibana/issues/3787

Looks like that feature is not implemented yet though.

(Rashid Khan) #10

Fair enough, here you go: https://github.com/elastic/kibana/pull/11476

(Rashid Khan) #11

Ok, I have another solution to this that doesn't require you to wait for 5.5 nor does it make you set a global setting, but it does require you install a plugin, at least until I can get this merged into core. You'll need the timelion-extras plugin from here:

I added support for date math to the moving average function. So now you can do something like:

.es(metric=sum:bytes_per_minute).mvavg(1m) 

This accomplishes the same thing as that blog post, except it also works if the low buckets have data in them. In the case that the interval is smaller than 1m, this will average all the points over a minute. If the interval is over 1m, it won't do anything, the series will remain the same.

In action:

(Rashid Khan) #12

Here's a gif that illustrates the effect with a data set more similar to yours:

(Sjaak) #13

Awesome! Thank you so much :smiley:

Works like a charm. The only small thing I noticed is that if you zoom in under one minute the graph will always show 0 but that might make sense if there is no data for that period.

I'm testing with .mvavg(3m) now and this does away with any spikes or sudden drops.

(Rashid Khan) #14

Submitted a pull to upstream: https://github.com/elastic/kibana/pull/11555

2 Likes
Timelion - Second part of query ignored
(Rashid Khan) #15

Pull merged, it will be in 5.5. When 5.5 comes out I'll remove movingaverage() from timelion-extras

(Sjaak) #16

Thanks for the effort.

What command can we use after 5.5. comes out or will it automagically?

(Rashid Khan) #17

The command will be the same, but you'll need to update or remove timelion-extras so you get the native version

1 Like
(Siamak Layeghy) #18

Hi Rashid,

Thanks for the explanations.
I have the same exact problem and I think, both solutions does not solve the issue.
I am trying to illustrate netflow data in ES and Kibana.
I have a field which is called IN_BYTES (input bytes) and I have used four different methods t visualize it, but still there is problem. I have attached a screenshot to help :

The left top figure is generated using Visual Builder,

The left bottom figure is generated in Timelion using:
.es(metric='sum:IN_BYTES').multiply(8).divide(1048576).mvavg(1s).color(#0FFFF0).lines(width=1,fill=2).label('Input').title('Traffic [Mbps]')

The top right figure is generated in Timelion using:
.es(index=nprobe*, timefield=@timestamp, metric=sum:IN_BYTES).cusum().derivative().mvavg(5).multiply(8).divide(1048576).lines(fill=2,width=1).color(#00FF00).label("Input").title("Traffic [Mbps]")

The right bottom figure is created in Timelion using:
.es(metric='sum:IN_BYTES').multiply(8).divide(1048576).scale_interval(1s).fit(scale).lines(width=2,fill=1).color(#00FF00).label('input').title('Traffic [Mbps]')
As you can see there is no consistency between Y-axis values, while the scale-fitted figure shows correct values (right-bottom) according to interface.

Then problem rises when I zoom in, for instance to 6 minutes of data. As you can see in the below image:


The values goes several times up which is not reasonable at all, if interface speed has been less than 200 Mbps as seen in the selected area:

It cannot be more than 1000 Mbps in the same period as seen:

I appreciate if you could help me.

Cheers,
Siamak

(Sjaak) #19

This is the query that I'm using and that in my testing gave me the correct results.

$src_query='host:1.1.1.1', .es($src_query,metric='sum:netflow.in_bytes').mvavg(3m).scale_interval(1s).divide(1024).label('Up - KBps').color('red'), 

Keep in mind that you might have to change mvavg depending on the device you're working with. I'm collecting netflow from a Fortigate and set that to export flows every 3 minutes. Setting mvavg(1m) would give me incorrect results.

From a quick this with pfsense that I can't really remember I believe i had mvavg(1m) for correct results.

I suggest that you limit your bandwidth to 1mbps or whatever and turn on a big download, make sure its downloading at a steady'ish speed and now set your graph to auto refresh and check which setting gives you correct results.

1 Like
(system) closed #20

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.