Rollup strategies with automatic old-raw-data deletion


(Paulo Reis) #1

Hi,

We're using ES for Iot. A lot of data are stored in our cluster, every 3s something is indexed and after a year, we end up with tons of data that are not that important anymore so we're trying to figure out a way to aggregate the old data with a less granularity.

Example: A month old data should be aggregated (using average in wanted fields) by minute. A year old data should be aggregated hourly. After the ROLLUP (If I can call that way) happens, the old-raw-data should be deleted (or DUMPED to a file and then removed from index).

In our current configuration, we're using a pipeline that creates a index annually, and studying the ROLLUP API, I don't think we can accomplish:

  1. ROLLUP on a previous ROLLUP DATA (monthly data should be transformed to a yearly data, we should store the raw data, create a new ROLLUP yearly and then erase the previous monthly ROLLUP?)
  2. Automatic index deletion (for old-raw data)

Is there anyway to do something similar to this?

Thanks in advance.


(Damien Alexandre) #2

I tried the new Roll Up feature, and one thing that struck me is that the data you rolled up is not deleted, and there is no option to do it.

So I guess the idea is to run your own DELETE or Delete by query... But you have to make sure the RollUp Job finished before doing that and getting all the benefits from the feature.


(Zachary Tong) #3

This is correct, the Rollup API explicitly doesn't touch the original source data. We did it this way for a few reasons. Philosophically, we wanted Rollup just to add functionality and not deal with the intricacies of data retention. Technically, it simplified the feature a lot.

For now, users will need to use something like Curator or Delete/Delete-By-Query to manage data retention of the source data (if they wish). In the future this will be handled more simply with Index Lifecycle Management

Tiering like this isn't currently supported, but it's something we've heard requests for. We're gauging how desired the feature is before implementing.

It adds complexity, and we're not entirely sure it's needed. E.g. the space savings going from "realtime" (sub-second documents usually) to say monthly is like three orders of magnitude. In comparison, moving from monthly to yearly rollups is a negligible change in compression.

E.g. on some metricbeat test data:

  • Metricbeat raw: 8.6m docs, 1.9GB
  • Rollup @ minute: 618k docs, 377MB
  • Rollup @ hourly: 40k docs, 30MB

Just going from raw to hourly rollup was a 98% decrease in size.

All that said, because monthly/yearly rollups are so cheap to store, if you want it, we'd just recommend configuring extra jobs that do the larger intervals in parallel to hourly/daily. Then you'll have the large granularity intervals if you want them, and they are so cheap to store it won't really matter that the data is "duplicated".

Hope that helps! Happy to answer any questions, or take note of more feedback and feature requests. The Rollup API is still very experimental, so we're trying to figure out what users need/want. :slight_smile:


(Paulo Reis) #4

Understood, will be nice to see more improvements in the ROLLUP API. Thanks for the explanation.

Best Regards,

Paulo.


(Paulo Reis) #5

I don't know if you get me right... I don't want to have yearly data (that meaning aggregate all the data in that year to one record)... but as I've said a little bit earlier:

I meant by yearly data aggregated by hour for that year. Exactly what you described using metric beats.

Going from realtime to minute (for one month old records) and then from minute to hourly (for one year old records).


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.