Cumulative Sum regardless of time interval


(Chris Stone) #1

I've created a line chart that plots the cumulative sum of unique terms over time. I'd like this to work with a daily time interval. What I expect is for the sum to be calculated from the values at the beginning of the line to its end, disregarding the daily time interval. Instead, I get the sum of terms unique only within each interval. Is there a way to get the behavior I'm expecting?

Thanks,

--Chris


Cumulative Sum Regardless of Time Interval - ES
(Lee Drengenberg) #2

Can you paste a screenshot of what you have and then use that to describe how you'd like to change it?

I did this, created some sample data (ignore the "words" and "id" fields. Just note that there are different values for "action" in each minute interval (you could paste this in the Dev Tool > Console;

POST discuss/test/
{
    "date" : "2017-08-15T00:01:00",
    "action" : "start",
    "myid" : 1,
    "words" : "these are words right here and in fact a long sentance cr /n break <BR>"
}

POST discuss/test/
{
    "date" : "2017-08-15T00:01:00",
    "action" : "stop",
    "myid" : 6,
    "words" : "these are words right here and in fact a long sentance cr /n break <BR> followed by more words"
}


POST discuss/test/
{
    "date" : "2017-08-15T00:02:00",
    "action" : "stop",
    "myid" : 1
}


POST discuss/test/
{
    "date" : "2017-08-15T00:03:00",
    "action" : "test",
    "myid" : 2
}


POST discuss/test/
{
    "date" : "2017-08-15T00:04:00",
    "action" : "aaaa",
    "myid" : 2
}

POST discuss/test/
{
    "date" : "2017-08-15T00:04:00",
    "action" : "bbbb",
    "myid" : 3
}

POST discuss/test/
{
    "date" : "2017-08-15T00:04:00",
    "action" : "cccc",
    "myid" : 2
}

POST discuss/test/
{
    "date" : "2017-08-15T00:04:00",
    "action" : "dddd",
    "myid" : 3
}

POST discuss/test/
{
    "date" : "2017-08-15T00:04:00",
    "action" : "eeee",
    "myid" : 4
}

I created an index pattern in Kibana for this discuss index.

Then I created a Line chart visualization with the metric being the unique count of action.keyword.

It looks like I would expect, but it sounds like you want it to show the sum of the unique counts added each internal?

If you use Timelion you could use this expression to show unique counts just like in the Line Chart Visualization;

.es(index=discuss, timefield=date, metric=cardinality:action.keyword)

And with Timelion I can add the cumulative sum function like this;

.es(index=discuss, timefield=date, metric=cardinality:action.keyword).cusum()

Which shows those counts adding up each interval;

Regards,
Lee


(Chris Stone) #3

Thanks Lee! Here's what I've got now:

My desire is to plot the cumulative count of unique IPs. The above shows as plotted by day, and the following is as plotted by hour. I would expect the final total to be the same, however since the data is calculated unique only within the specified time interval, they are not showing as equal.

I have not yet tried in Timelion what you've shown, but will now and report back on what I end up with.

Thanks again!


(Lee Drengenberg) #4

Ah, right, I just remembered the Pipeline Aggregations are in Kibana now and was going to post an example of using it. You beat me to it.

I think you'll have the same issue in Timelion.

If at one timestamp you have a, b, c, and in the next you have b, c, d, then if your interval is small enough that those timestamps are in different buckets you'll have a cumulative sum of 3 + 3 = 6. But in a larger interval they both fall into the same bucket and you get a, b, d, d = 4


(Chris Stone) #5

Thanks, so sounds like I'm out of luck then?


(Lee Drengenberg) #6

I'm thinking maybe you don't want to show the cumulative sum of unique IPs in each time interval.

I would think you want that last measurement on the right of your timescale to show the total unique count for the entire span?
For example, if in hour 1 you have 20 unique IP addresses, and in hour 2 you have 25, but 10 of the IPs are the same from hour 1 and hour 2, do you want the line to end at the right side at 35 (the total unique IPs), or 45 (the sum of the unique count during each hour?


(Chris Stone) #7

Exactly -- I want the total unique IPS over the entire span. Thanks for looking into this :slight_smile:


(Lee Drengenberg) #8

Now the hard part is figuring out how to do it. Or if it's possible.... I'll look at it again tomorrow.

Regards,
Lee


(Lee Drengenberg) #9

Sorry Chris, I've asked around and haven't found the solution. I'm not sure it's possible


(Chris Stone) #10

Thanks for trying, Lee. Wonder if this would be considered a bug since the behavior here isn't what would typically be expected?

--Chris


(Lee Drengenberg) #11

I don't think it would be a Kibana bug, since Kibana only uses queries it can build against Elasticsearch. And I don't know that there's any way to do a query like that against Elasticsearch. You could post a question on the Elasticsearch channel to see. And that could lead to an enhancement request issue.

The trick is to describe the cumulative unique count you're looking for. It's not the unique count of each time bucket (which we already tried). Instead of sequential time buckets, you need each time bucket to have the same starting time, but an increasing ending time. So each bucket is longer.

If the Elasticsearch team describes some kind of pipeline aggregation query that can do that, we could try again in Kibana to accomplish that. But it's beyond my query skill level at the moment.

Actually, now I'm thinking there could be a way. If we flip the order of things. Instead of finding the unique count of IPs within each bucket, start with the query of unique IPs in the whole timespan and then get the min timestamp for each of those. Not sure how to do that yet, but I'll try.


(Lee Drengenberg) #12

Here's as far as I got with help from a Kibana developer. In this Data Table visualization we Split Rows on a Terms aggregation on clientip and ordered by Min @timestamp.
And in the metrics used a Top Hit aggregation on @timestamp with size 1 (we only want the first timestamp of each IP).
But I couldn't do this same thing on a Line Chart, and I don't know if the Terms agg will scale to many thousands of IP addresses.
What you really want from this point is to get the count of those IPs into time buckets. I don't see any way in Kibana to do that.

I think we're back to asking the Elasticsearch team if it's really critical to you. If you find a query that gives the right results we might be able to use the Advanced JSON Input fields to get it to work in Kibana.


(Chris Stone) #13

Thanks for all your work with that. I'll give it a try and see if it'll work for me and keep you posted.

Thanks again,

--Chris


(Peng Chen) #14

Hi Chris,
Have you got this issue solved? I have the similar needs. It's up to how to define cumulative, for now seems the cusum function in Kibana takes the time range as a filter for calculation.


(Chris Stone) #15

Hi Peng,

No, I haven't. I've been meaning to contact the eleasticsearch team as suggested to see if it's something they could address. I'll try to get to that today and let you know. But are you suggesting that it might be a Kibana issue after all?

Thanks,

--Chris


(Chris Stone) #16

Elasticsearch topic:


(Peng Chen) #17

Hi Chris
I feel it's up to how people are defining cusum. For me it could be a cumulative number based on the whole time series. It equals to sum an auto date histogram of a certain metric from begging of the specified timefield until now and when date range filter is applied, it's just slice a piece of data from timeline instead of re-calculate it based on the date range filter. There might be needs for both cases, I just hope if kibana team could add a variable in cusum function to make both possible. Now, if this is not possible, I'm looking to see if it's possible to build custom function for timelion...


Visualize count of unique identities over time
(Peng Chen) #18

By looking at the cusum function code, it depends on the series gets passed in, and date range filter must have been applied to the series before it arrives cusum function, it could be quite difficult to achieve what we wish to have...


(Lee Drengenberg) #19

If you can get the list of min timestamps for each IP in the entire timespan,
then get the count of those in each time bucket and graph that.

(I don't think that would be using cumulative sum at all)


(system) #20

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.