We're using ES for Iot. A lot of data are stored in our cluster, every 3s something is indexed and after a year, we end up with tons of data that are not that important anymore so we're trying to figure out a way to aggregate the old data with a less granularity.
Example: A month old data should be aggregated (using average in wanted fields) by minute. A year old data should be aggregated hourly. After the ROLLUP (If I can call that way) happens, the old-raw-data should be deleted (or DUMPED to a file and then removed from index).
In our current configuration, we're using a pipeline that creates a index annually, and studying the ROLLUP API, I don't think we can accomplish:
ROLLUP on a previous ROLLUP DATA (monthly data should be transformed to a yearly data, we should store the raw data, create a new ROLLUP yearly and then erase the previous monthly ROLLUP?)
I tried the new Roll Up feature, and one thing that struck me is that the data you rolled up is not deleted, and there is no option to do it.
So I guess the idea is to run your own DELETE or Delete by query... But you have to make sure the RollUp Job finished before doing that and getting all the benefits from the feature.
This is correct, the Rollup API explicitly doesn't touch the original source data. We did it this way for a few reasons. Philosophically, we wanted Rollup just to add functionality and not deal with the intricacies of data retention. Technically, it simplified the feature a lot.
For now, users will need to use something like Curator or Delete/Delete-By-Query to manage data retention of the source data (if they wish). In the future this will be handled more simply with Index Lifecycle Management
Tiering like this isn't currently supported, but it's something we've heard requests for. We're gauging how desired the feature is before implementing.
It adds complexity, and we're not entirely sure it's needed. E.g. the space savings going from "realtime" (sub-second documents usually) to say monthly is like three orders of magnitude. In comparison, moving from monthly to yearly rollups is a negligible change in compression.
E.g. on some metricbeat test data:
Metricbeat raw: 8.6m docs, 1.9GB
Rollup @ minute: 618k docs, 377MB
Rollup @ hourly: 40k docs, 30MB
Just going from raw to hourly rollup was a 98% decrease in size.
All that said, because monthly/yearly rollups are so cheap to store, if you want it, we'd just recommend configuring extra jobs that do the larger intervals in parallel to hourly/daily. Then you'll have the large granularity intervals if you want them, and they are so cheap to store it won't really matter that the data is "duplicated".
Hope that helps! Happy to answer any questions, or take note of more feedback and feature requests. The Rollup API is still very experimental, so we're trying to figure out what users need/want.
I don't know if you get me right... I don't want to have yearly data (that meaning aggregate all the data in that year to one record)... but as I've said a little bit earlier:
I meant by yearly data aggregated by hour for that year. Exactly what you described using metric beats.
Going from realtime to minute (for one month old records) and then from minute to hourly (for one year old records).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.