Easy way to insert top level query aggregation details back into elastic


(Sandeep Takhar) #1

Hi.

Been following Zach's pretty cool work around getting the ebay anomaly detection algorithm to work in elastic.

In his example he has three level grouping. I just want the top level term and ninetieth_surprise and send that back into elastic and looking for ideas. I've searched around, but not finding much.

I have the example working with my data, but all I did was change three values to match my field names.

Here is Zach's example:

There are 5 "metrics" with 5 ninetieth_percentiles in the json (in my json I get two ninetieth percentiles for each top level for some reason). My json return has all the bottom level buckets. I just want to insert the top level metric name and ninetieth_percentile..which is part of the top level bucket.

{
"query": {
"filtered": {
"filter": {
"range": {
"hour": {
"gte": "{{start}}",
"lte": "{{end}}"
}
}
}
}
},
"size": 0,
"aggs": {
"metrics": {
"terms": {
"field": "metric",
“size”: 5
},
"aggs": {
"queries": {
"terms": {
"field": "query",
"size": 500
},
"aggs": {
"series": {
"date_histogram": {
"field": "hour",
"interval": "hour"
},
"aggs": {
"avg": {
"avg": {
"field": "value"
}
},
"movavg": {
"moving_avg": {
"buckets_path": "avg",
"window": 24,
"model": "simple"
}
},
"surprise": {
"bucket_script": {
"buckets_path": {
"avg": "avg",
"movavg": "movavg"
},
"script": "(avg - movavg).abs()"
}
}
}
},
"largest_surprise": {
"max_bucket": {
"buckets_path": "series.surprise"
}
}
}
},
"ninetieth_surprise": {
"percentiles_bucket": {
"buckets_path": "queries>largest_surprise",
"percents": [
90.0
]
}
}
}
}
}
}


(Sandeep Takhar) #2

One thing I'm doing is to use filter_path. Actually Zach mentioned it in his second article and it took me a while to find it. My top level field name is different, but it looks like this:

I think I'll just use logstash to read this file that I output and dump it into elastic...I've already got a framework for doing just that and I've dealt with json objects before.

pretty=true&human=false&flat_settings=true&filter_path=aggregations.agent_names.buckets.key,aggregations.agent_names.buckets.ninetieth_surprise.values

Don't know if it's a good way or not to do it, but I'll give it a try.


(Sandeep Takhar) #3

Here is how I flattened out and split the resulting output as well...now I will just send to elastic using the output plugin. Again..not sure if best way, but it works:

input {
stdin { codec => json }
}

#filter{

grok{

match => ["message","%{GREEDYDATA:msg}"]

}

#trick to reparse the message from text file (brought in as text)

json {

source => "msg"

}

#}

filter
{
mutate {
rename => [
"[aggregations][agent_names][buckets]", "buckets"
]
remove_field => "aggregations"
}
}

filter {
split {
field => "buckets"
}
}

filter
{
mutate {
rename => [
"[buckets][key]", "agent_name",
"[buckets][ninetieth_surprise][values][90.0]", "ninetieth_surprise"
]
remove_field => "buckets"
}
}

output {
stdout { codec => rubydebug }
}


(Sandeep Takhar) #4

Here is the data I get using filter_data, or at least part of it..than can be used with above config. In case anyone is following. I'll create a cron job now and see how the data looks like in timelion. There are plenty of null values because it is staging environment and it's not very busy at the moment.

{"aggregations":{"agent_names":{"buckets":[{"key":"custmgtbill2srv1","ninetieth_surprise":{"values":{"90.0":0.00917166937529728}},"ninetieth_surprise":{"values":{"90.0":0.00917166937529728}}},{"key":"entmarketoffersvc3","ninetieth_surprise":{"values":{"90.0":0.016666666666666666}},"ninetieth_surprise":{"values":{"90.0":0.016666666666666666}}},{"key":"custmgtconsumersvc1","ninetieth_surprise":{"values":{"90.0":0.017316291670002933}},"ninetieth_surprise":{"values":{"90.0":0.017316291670002933}}},{"key":"custmgtconsumersvc4","ninetieth_surprise":{"values":{"90.0":0.01162318909094097}},"ninetieth_surprise":{"values":{"90.0":0.01162318909094097}}},{"key":"billpresentmentweb3","ninetieth_surprise":{"values":{"90.0":0.0}},"ninetieth_surprise":{"values":{"90.0":0.0}}},{"key":"custmgtconsumerweb4","ninetieth_surprise":{"values":{"90.0":1.4005602240896359E-6}},"ninetieth_surprise":{"values":{"90.0":1.4005602240896359E-6}}},{"key":"custmgtfulweb1","ninetieth_surprise":{"values":{"90.0":0.0}},"ninetieth_surprise":{"values":{"90.0":0.0}}},{"key":"custmgtfulweb3","ninetieth_surprise":{"values":{"90.0":0.0}},"ninetieth_surprise":{"values":{"90.0":0.0}}},{"key":"custmgtfulweb4","ninetieth_surprise":{"values":{"90.0":0.0}},"ninetieth_surprise":{"values":{"90.0":0.0}}}]}}}


(Sandeep Takhar) #5

I'll also post the curl which I use that is in a cronjob in case someone runs into this post to complete out the turn-key solution (?)

r" : {
"and" : [ {
"range" : {
"@timestamp" : {
"gte": "now-1h",
"lte": "now"
}
}
},
{
"terms" : { "metric_name" : ["durationmean"]}
} ]
}
}
},
"size": 0,
"aggs": {
"agent_names": {
"terms": {
"field": "agent_name",
"size": 5000
},
"aggs": {
"metric_names": {
"terms": {
"field": "metric_name",
"size": 10000
},
"aggs": {
"series": {
"date_histogram": {
"field": "@timestamp",
"interval": "minute"
},
"aggs": {
"avg": {
"avg": {
"field": "metric_value"
}
},
"movavg": {
"moving_avg": {
"buckets_path": "avg",
"window": 60,
"model": "simple"
}
},
"surprise": {
"bucket_script": {
"buckets_path": {
"avg": "avg",
"movavg": "movavg"
},
"script": "(avg - movavg).abs()"
}
}
}
},
"largest_surprise": {
"max_bucket": {
"buckets_path": "series.surprise"
}
}
}
},
"ninetieth_surprise": {
"percentiles_bucket": {
"buckets_path": "metric_names>largest_surprise",
"percents": [
90.0
]
}
}
}
}
}
}'


(system) #6