Hi everyone,
We are using collectd to gather NIC stats from multiple hosts and ship them to our ES cluster using logstash. The collected data looks as follows:
{"@timestamp":"2016-05-25T10:00:30.000+0100", "host":"host1", "collectd_type":"if_octets", "rx": 1000}
{"@timestamp":"2016-05-25T10:00:30.000+0100", "host":"host2", "collectd_type":"if_octets", "rx": 2000}
{"@timestamp":"2016-05-25T10:00:40.000+0100", "host":"host1", "collectd_type":"if_octets", "rx": 1000}
Data is received every 10 seconds, one record per host. The rx
field is an ever increasing counter representing the total amount of bytes received by the NIC since the machine is alive. Nothing new here for those familiar with collectd...
We'd like to draw a graph representing the bandwidth (bytes per seconds) used per host over time. For this we use an histogram aggregation followed by an a derivative
to compute the amount of bytes received during the bucket period. The aggs
part of the query looks as follows (only the relevant parts are included):
"aggs": {
"4": {
"terms": { "field": "host" },
"aggs": {
"2": {
"date_histogram": {
"interval": "10s",
"field": "@timestamp"
},
"aggs": {
"1": { "max": { "field": "rx" } },
"3": { "derivative": { "buckets_path": "1", "unit":"second" } }
}
}
}
}
As I understand, here is what happens:
- the
terms
aggregation creates a bucket for each host with matching documents - the
date_histogram
aggregation further splits every host buckets into smaller buckets of 10s duration based on the@timestamp
field - for each 10s period (and for each host), the
max
function retains only the maximumrx
value found in the time range. Since the value is an increasing counter, we could have used themin
as well... - finally, the
derivative
returns the difference of the max value between two consecutive buckets - which is the amount of bytes received during that period.
(stop me here if I'm wrong...)
So far so good. That query gives a breakdown of the used bandwidth per host. Now the question: how should I write the query to have the total bandwidth of all hosts ?
Removing the terms
aggregation on the host
field is not correct as it will break the max and derivative portion responsible to convert the counter value into a rate. As far as I understand, I should be able to build something like:
- create buckets from the
date_histogram
- split each bucket per host
- for each host, find the maximum value (or the last based on timestamp)
- compute derivative for each host, per bucket to get the amount of bytes for the time period
- then sum the total bytes of each host
Unfortunately, I have been unable to construct such query.
Any help would be appreciated.
Thanks