Calculating rolling average using aggregations


(Simon Cast) #1

Hi,

I'm looking at using the new aggregations module of Elasticsearch to
produce some metrics for my application.

I want to calculate the rolling average for every hour in the last 28 days
and then plot these on a chart. I can see how this could be done using a
query per hour for the last 28 days. This seems problematic in terms of
load and performance. Can this be done using a single query?

Regards,

Simon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d0a2dba7-f5a5-4944-ae0e-c6771b43ad6c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #2

Yes, easily. Aggregations are really powerful. Here's an example:

First insert some data

curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
"created": "2014/03/10 12:05:00",
"somefield": 10
}'

curl -XPUT "http://localhost:9200/myindex/mytype/2" -d'
{
"created": "2014/03/10 12:05:00",
"somefield": 5
}'

curl -XPUT "http://localhost:9200/myindex/mytype/3" -d'
{
"created": "2014/03/9 12:05:00",
"somefield": 5
}'

Then return all results in the last 28 days

and calculate the avg per hour

curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
"query": {
"range": {
"created": {
"gte": "now-28d/d",
"to": "now"
}
}
},
"aggs": {
"per_hour": {
"date_histogram": {
"field": "created",
"interval": "hour"
},
"aggs": {
"rolling_avg": {
"avg": {
"field": "somefield"
}
}
}
}
}
}'

On 11 March 2014 09:26, Simon Cast simon.cast@gmail.com wrote:

Hi,

I'm looking at using the new aggregations module of Elasticsearch to
produce some metrics for my application.

I want to calculate the rolling average for every hour in the last 28 days
and then plot these on a chart. I can see how this could be done using a
query per hour for the last 28 days. This seems problematic in terms of
load and performance. Can this be done using a single query?

Regards,

Simon

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d0a2dba7-f5a5-4944-ae0e-c6771b43ad6c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d0a2dba7-f5a5-4944-ae0e-c6771b43ad6c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKS-tpSZMwM8WVBuXg%3Dez618xbTnCSTXAkWGrFsSxmdnew%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Simon Cast) #3

Thanks Clinton.

That looks interesting but that creates an average of values per hour for
the last 28 days. What I am looking for more precisely is to calculate a 28
rolling (or moving average) using the last 28 days of data and redoing that
calculation every hour.

I suppose what could be done is to do an average on a field that is a
hourly count.

On Tuesday, March 11, 2014 11:35:40 AM UTC, Clinton Gormley wrote:

Yes, easily. Aggregations are really powerful. Here's an example:

First insert some data

curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
"created": "2014/03/10 12:05:00",
"somefield": 10
}'

curl -XPUT "http://localhost:9200/myindex/mytype/2" -d'
{
"created": "2014/03/10 12:05:00",
"somefield": 5
}'

curl -XPUT "http://localhost:9200/myindex/mytype/3" -d'
{
"created": "2014/03/9 12:05:00",
"somefield": 5
}'

Then return all results in the last 28 days

and calculate the avg per hour

curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
"query": {
"range": {
"created": {
"gte": "now-28d/d",
"to": "now"
}
}
},
"aggs": {
"per_hour": {
"date_histogram": {
"field": "created",
"interval": "hour"
},
"aggs": {
"rolling_avg": {
"avg": {
"field": "somefield"
}
}
}
}
}
}'

On 11 March 2014 09:26, Simon Cast <simon...@gmail.com <javascript:>>wrote:

Hi,

I'm looking at using the new aggregations module of Elasticsearch to
produce some metrics for my application.

I want to calculate the rolling average for every hour in the last 28
days and then plot these on a chart. I can see how this could be done using
a query per hour for the last 28 days. This seems problematic in terms of
load and performance. Can this be done using a single query?

Regards,

Simon

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d0a2dba7-f5a5-4944-ae0e-c6771b43ad6c%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d0a2dba7-f5a5-4944-ae0e-c6771b43ad6c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/15324eca-2048-4a57-abe4-f1730e1bf70e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #4

Forgive my ignorance but could you provide a worked example of what you
mean? I'm not getting exactly what you're after.

On 12 March 2014 13:24, Simon Cast simon.cast@gmail.com wrote:

Thanks Clinton.

That looks interesting but that creates an average of values per hour for
the last 28 days. What I am looking for more precisely is to calculate a 28
rolling (or moving average) using the last 28 days of data and redoing that
calculation every hour.

I suppose what could be done is to do an average on a field that is a
hourly count.

On Tuesday, March 11, 2014 11:35:40 AM UTC, Clinton Gormley wrote:

Yes, easily. Aggregations are really powerful. Here's an example:

First insert some data

curl -XPUT "http://localhost:9200/myindex/mytype/1" -d'
{
"created": "2014/03/10 12:05:00",
"somefield": 10
}'

curl -XPUT "http://localhost:9200/myindex/mytype/2" -d'
{
"created": "2014/03/10 12:05:00",
"somefield": 5
}'

curl -XPUT "http://localhost:9200/myindex/mytype/3" -d'
{
"created": "2014/03/9 12:05:00",
"somefield": 5
}'

Then return all results in the last 28 days

and calculate the avg per hour

curl -XGET "http://localhost:9200/myindex/mytype/_search" -d'
{
"query": {
"range": {
"created": {
"gte": "now-28d/d",
"to": "now"
}
}
},
"aggs": {
"per_hour": {
"date_histogram": {
"field": "created",
"interval": "hour"
},
"aggs": {
"rolling_avg": {
"avg": {
"field": "somefield"
}
}
}
}
}
}'

On 11 March 2014 09:26, Simon Cast simon...@gmail.com wrote:

Hi,

I'm looking at using the new aggregations module of Elasticsearch to
produce some metrics for my application.

I want to calculate the rolling average for every hour in the last 28
days and then plot these on a chart. I can see how this could be done using
a query per hour for the last 28 days. This seems problematic in terms of
load and performance. Can this be done using a single query?

Regards,

Simon

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/d0a2dba7-f5a5-4944-ae0e-c6771b43ad6c%
40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/d0a2dba7-f5a5-4944-ae0e-c6771b43ad6c%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/15324eca-2048-4a57-abe4-f1730e1bf70e%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/15324eca-2048-4a57-abe4-f1730e1bf70e%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKRFsY_F-dJdp2vN_yLOtHZ1-brdE%2BvxN7ATdapf4rnOow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Binh Ly-2) #5

Unfortunately a MA aggregation is not available at the moment (unless you
accumulate/include all the prior data that you need into every doc, or you
run multiple queries and aggregate it yourself - neither of which is
desirable or flexible)

Clint, a MA is an average computed in a "sliding window" fashion, where the
time interval slides as the average is computed. So for example, let's say
I have a daily statistic for 5 days:

[Day1 = 100, Day 2 = 105, Day 3 = 101, Day 4 = 120, Day 5 = 200]

A 2 day in-the-past MA would be computed as follows:

Day1 = No data yet (since we don't have 2 days in-the-past worth of info)
Day2 = (Day1 + Day2) / 2 = (100 + 105) / 2
Day3 = (Day2 + Day3) / 2 = (105 + 101) / 2
...

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/89d909af-377e-44bd-8cec-31249f9f1d60%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #6

Heya Bihn

The part I'm not getting is this: "the rolling average for every hour in
the last 28 days". ie what period should each bucket/rolling avg cover? an
hour? 28 days?

You can still do rolling averages with aggregations, but they require a bit
more work. I wanted to get the exact specs before trying to answer again :slight_smile:

clint

On 12 March 2014 16:09, Binh Ly binhly_es@yahoo.com wrote:

Unfortunately a MA aggregation is not available at the moment (unless you
accumulate/include all the prior data that you need into every doc, or you
run multiple queries and aggregate it yourself - neither of which is
desirable or flexible)

Clint, a MA is an average computed in a "sliding window" fashion, where
the time interval slides as the average is computed. So for example, let's
say I have a daily statistic for 5 days:

[Day1 = 100, Day 2 = 105, Day 3 = 101, Day 4 = 120, Day 5 = 200]

A 2 day in-the-past MA would be computed as follows:

Day1 = No data yet (since we don't have 2 days in-the-past worth of info)
Day2 = (Day1 + Day2) / 2 = (100 + 105) / 2
Day3 = (Day2 + Day3) / 2 = (105 + 101) / 2
...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/89d909af-377e-44bd-8cec-31249f9f1d60%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/89d909af-377e-44bd-8cec-31249f9f1d60%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKSkaL59uXBw6b9b_0r_%2BJHVPR-WL2P0JuXox90xCJDh%3Dg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #7

OK, Binh took me aside and explained what you are after: a number per hour
which reflects the average of the previous 28*24 hours.

There are two ways you can do this with aggregations. The first way is to
create overlapping buckets, so each date will fall into multiple windows,
eg:

Jan 1, 10 am to Jan 28, 10am
Jan 1, 11 am to Jan 28, 11am
Jan 1, 12 am to Jan 28, 12am
etc

... which can be done by constructing all of the date ranges you need, ie
28 * 24 of them.

The second way to do it is to use a script to convert each date into
multiple dates:
$date
$date - 1h
$date - 2h
$date - 3h
...
$date - 672h

I've put together a simplified example (rolling avg per day) which
demonstrates both techniques. Unfortunately, the script version showed up
a bug in aggregations, which will be fixed in 1.1.0 - currently you can
include 4 values, but no more.

See the demo here: https://gist.github.com/clintongormley/9515005

clint

On 12 March 2014 16:29, Clinton Gormley clint@traveljury.com wrote:

Heya Bihn

The part I'm not getting is this: "the rolling average for every hour in
the last 28 days". ie what period should each bucket/rolling avg cover?
an hour? 28 days?

You can still do rolling averages with aggregations, but they require a
bit more work. I wanted to get the exact specs before trying to answer
again :slight_smile:

clint

On 12 March 2014 16:09, Binh Ly binhly_es@yahoo.com wrote:

Unfortunately a MA aggregation is not available at the moment (unless you
accumulate/include all the prior data that you need into every doc, or you
run multiple queries and aggregate it yourself - neither of which is
desirable or flexible)

Clint, a MA is an average computed in a "sliding window" fashion, where
the time interval slides as the average is computed. So for example, let's
say I have a daily statistic for 5 days:

[Day1 = 100, Day 2 = 105, Day 3 = 101, Day 4 = 120, Day 5 = 200]

A 2 day in-the-past MA would be computed as follows:

Day1 = No data yet (since we don't have 2 days in-the-past worth of info)
Day2 = (Day1 + Day2) / 2 = (100 + 105) / 2
Day3 = (Day2 + Day3) / 2 = (105 + 101) / 2
...

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/89d909af-377e-44bd-8cec-31249f9f1d60%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/89d909af-377e-44bd-8cec-31249f9f1d60%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKQ3ZGG9Ak4AWUgsk%3DYMkQez9%3DLWdixzosRbDQS5ouqUiQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Clinton Gormley) #8

I rethought this problem last night. The solutions I've presented already
are a lot less efficient than they could be, as they increase the work per
doc by a factor of the number of buckets (ie 24h * 28d = 672).

It'd be much more efficient to calculate this rolling average client side
in a single pass over the data, eg:

curl -XGET "http://localhost:9200/myindex/_search?size=0" -d'
{
"aggs": {
"per_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"format": "yyyy-MM-dd"
},
"aggs": {
"total": {
"sum": {
"field": "num"
}
}
}
}
}
}'

This gives you the doc count per interval, plus the total of num per
interval. Now, it is easy to calculate the rolling average in a single pass
across each slot:

  1. Add up the doc counts and num totals for all values in the desired
    window.
  2. Calc the average for the window
  3. End if there are no more docs
  4. Remove the oldest doc from total doc count and total num
  5. Add the next doc into total doc count and total num
  6. Repeat from step 2

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAPt3XKSuE5M3reGFnWBY0EytVJUb2omkn1-XtsD9yrKutXtDBQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Michaƫl Gallego) #9

+1 for this feature!

What I need is pretty similar: calculate rolling sum, so for each day, I
need to sum the previous 30 days (on each point). Oracle and Postgre make
this very easy with aggregation function (and they can take advantage of
very interesting optimization for sum, as each point is actually the sum of
previous day + sum of elements in the given day). Actually, the simplest
way I've found is asking for more data (so if I want rolling sum for 30
days, I ask for 60 days, and do the sum myself). But this is quite
inefficient.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b72eea67-d8cb-4ac3-ac4f-4e984183731b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Michael Sander) #10

Bumping this thread. I'm looking for a similar aggregation. I'd like to create a rolling cumulative average, meaning, for each day, calculate the average over all previous days. I'd like to plot how the cumulative average changes over time. This is different than a windowed average because the window gets longer as time goes on.

From what I've read above, it doesn't look like this can be done with a single aggregation. It looks like it requires an aggregation for each day. But perhaps some new features were added in the past two years that makes this possible. Any ideas?


(system) #11