We plan on storing lots of SNMP Counter64 (basically a 64 bit increasing counter) in Elasticsearch. This data is collected from our network equipment, currently we are thinking of polling in 1 minute intervals. When presenting these counters we are obviously more interested in showing the derivative ie. the rate of change. ES2.0 with the derivative pipeline aggregation helps a lot in this regard.
For querying these timeseries data, we'd use the histogram bucket aggregation, then the Max metric aggregation to return the highest value in the bucket, and then the derivative pipeline aggregation to get the derivative.
This works fine for monotonically increasing counters, if one neglects the presence of overflows and resets.
However, we are worried about overflows and resets. Overflow is, when the 64bit counter wraps around, again starting at 0 and incrementing from there. Reset is when the counter is manullay reset, usually via the CLI in the networking equipment.
Theoretically both resets and overflows only impact the derivative calculation of one sample interval (1 minute in our case), however due to the Max metric aggregation getting the value wrong for a whole bucket with either a overflow or reset in the bucket, this will affect a much larger timespan (for 1 year graphs the time interval may be six hours).
Similarly for the derivative calculation. The overflow or reset may just as likely happen on the border between two histogram buckets.
Are there any good warstories about how to manage Counter values in elasticsearch?
(On a side note: yes we only store 63 bits of the counter in elasticsearch due to the lack of an unsigned 64bit integer type - this effectively doubles the number of overflows.)