We plan on storing lots of SNMP Counter64 (basically a 64 bit increasing counter) in Elasticsearch. This data is collected from our network equipment, currently we are thinking of polling in 1 minute intervals. When presenting these counters we are obviously more interested in showing the derivative ie. the rate of change. ES2.0 with the derivative pipeline aggregation helps a lot in this regard.
For querying these timeseries data, we'd use the histogram bucket aggregation, then the Max metric aggregation to return the highest value in the bucket, and then the derivative pipeline aggregation to get the derivative.
This works fine for monotonically increasing counters, if one neglects the presence of overflows and resets.
However, we are worried about overflows and resets. Overflow is, when the 64bit counter wraps around, again starting at 0 and incrementing from there. Reset is when the counter is manullay reset, usually via the CLI in the networking equipment.
Theoretically both resets and overflows only impact the derivative calculation of one sample interval (1 minute in our case), however due to the Max metric aggregation getting the value wrong for a whole bucket with either a overflow or reset in the bucket, this will affect a much larger timespan (for 1 year graphs the time interval may be six hours).
Similarly for the derivative calculation. The overflow or reset may just as likely happen on the border between two histogram buckets.
Are there any good warstories about how to manage Counter values in elasticsearch?
(On a side note: yes we only store 63 bits of the counter in elasticsearch due to the lack of an unsigned 64bit integer type - this effectively doubles the number of overflows.)
Did you ever figure out how to handle this? I am building basically the same setup and although I have not-yet aggregated the data, the overflow is going to be a large hurdle. I'm curious how you resolved this.. Thanks!
i think we ignored the problem for now. In practice, for our use case, this turns out not be a big issue. The few errors on graphs we get due to resets and overflows is ok for us.
However, we are no longer convinced elasticsearch was as good a choice as we believed. Six month down the road we have been batteling too many issues we didnt expect. A more pure timeseries database may be more well suited for counter values. RRDtool did many things right (including this problem) and that may be worth considering.
I'm somewhat stuck on Elasticsearch / Logstash for the time being, as that is what our other sites are using. I am investigating RRDTool for the purpose of structuring the data and then passing it into my ELK stack. From what I'm reading it doesn't appear that you use Kibana.
Was your aggregation setup pretty straightforward? I'm looking to do the same thing, which doesn't appear an easy implementation on Kibana. Furthermore, I'm not very savvy with aggregations yet, and it appears a substantial amount of work for such a simple end result. All that being said, I'll have to find a solution but am hesitant to spend all the time implementing aggregations without knowing that's the proper technique for creating gauges out of these snmp counters.
I'm also curious if the issues you faced were specifically regarding network monitoring. That is the primary purpose of my monitoring stack. Any pitfalls to watch out for specifically? Thanks for your time and any knowledge you're willing to share.
so you want to add the counters to an rrdtool file, and then put it into elasticsearch afterwards? Interesting idea.
No we dont use Kibana. We wanted to use grafana at first, but found it too dificult to secure in order to show graphs to customers.
WRT ingesting: we tried filebeat, logstash and fluent. In the end we built a python programm to read a json newline file and ingest it directly into elasticsearch. this gave us way more resilience in case of high load, missing network any other problem that might turn up. we really wanted to make sure we dont loose data and couldnt figure out to get the other ingesters to work reliable. We have a nodejs programm collect the data via snmp, outputting to the earlier mentioned json newline file.
We used the aggregation that grafana provides as a starting point and adjusted the queries to our needs from there.
I'm starting to search for some alternative to RRDTool and I was thinking of putting elasticsearch to handle this, but read your post I'm worrie about.
Now, after nearly a year of your posting, do you recommend Elastic to do this kind of work?
We have moved out of Elasticsearch again and are back using rrdtool. The new solution is much (orders of magnitude) more performant...
but this is not elasticsearches fault, really. We were just to optimistic about what we could expect from elasticsearch.
We have seen elasticsearch read 22GB to output little more than 1 MB - and that takes a lot of time obviously. Especially when answering multiple such queries...
We are using elasticsearch for search though and are very happy with it