I want to compare the daily average of a metric (the frequency of words appearing in texts) to the value of a specific day. This is during a week. My goal is to check whether there's a spike. If the last day is way higher than the daily average, I'd trigger an alarm.
So from my input in Elasticsearch I compute the daily average during the week and find out the value for the last day of that week.
For getting the daily average for the week, I simply cut a week's worth of data using a range
query on date
field, so all my available data is the given week. I compute the sum and divide by 7 for a daily average.
For getting the last day's value, I did a terms aggregation on the date
field with descending order and size 1
The whole output is as follows. Here you can see words "rama0" and "rama1" with their corresponding frequencies.
{
"aggregations" : {
"the_keywords" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets" : [
{
"key" : "rama0",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
{
"key" : "rama1",
"doc_count" : 4200,
"the_last_day" : {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 3600,
"buckets" : [
{
"key" : 1580169600000,
"key_as_string" : "2020-01-28T00:00:00.000Z",
"doc_count" : 600,
"the_last_day_frequency" : {
"value" : 3000.0
}
}
]
},
"the_weekly_sum" : {
"value" : 21000.0
},
"the_daily_average" : {
"value" : 3000.0
}
},
[...]
]
}
}
}
Now I have the_daily_average
in a high level of the output, and the_last_day_frequency
in the single-element buckets list in the_last_day
aggregation. I cannot use a bucket_script
to compare those, because I cannot refer to a single bucket (if I place the script outside the_last_day
aggregation) and I cannot refer to higher-level aggregations if I place the script inside the_last_day
.
IMO the reasonable thing to do would be to put the script outside the aggregation and use a buckets_path
using the <AGG_NAME><MULTIBUCKET_KEY>
syntax mentioned in the docs, but I have tried "var1": "the_last_day[1580169600000]>the_last_day_frequency"
and variations (hardcoding first until it works), but I haven't been able to refer to a particular bucket.
My ultimate goal is to have a list of keywords for which the last day frequency greatly exceeds the daily average.
For anyone interested, my current query is as follows. Notice that the part I'm struggling with is commented out.
body='{
"query": {
"range": {
"date": {
"gte": "START",
"lte": "END"
}
}
},
"aggs": {
"the_keywords": {
"terms": {
"field": "keyword",
"size": 100
},
"aggs": {
"the_weekly_sum": {
"sum": {
"field": "frequency"
}
},
"the_daily_average" : {
"bucket_script": {
"buckets_path": {
"weekly_sum": "the_weekly_sum"
},
"script": {
"inline": "return params.weekly_sum / 7"
}
}
},
"the_last_day": {
"terms": {
"field": "date",
"size": 1,
"order": {"_key": "desc"}
},
"aggs": {
"the_last_day_frequency": {
"sum": {
"field": "frequency"
}
}
}
}/*,
"the_spike": {
"bucket_script": {
"buckets_path": {
"last_day_frequency": "the_last_day>the_last_day_frequency",
"daily_average": "the_daily_average"
},
"script": {
"inline": "return last_day_frequency / daily_average"
}
}
}*/
}
}
}
}'