Questions Regarding Deleting Source Data After Transform

From what I read from the other posts, if we are running a continuous transform, we should not delete any data from the source index as that will trigger the transform to re-compute and the deleted records will no longer be taken into account in the summarized index after refresh.

I have a few questions on that topic:

  1. Does this apply to both pivot and latest transform?
  2. If we are only interested in the summarized view and not the raw records in the source index, is there any other things we can do to reduce the storage cost from the source index if we cannot delete them? Writes into the source index can be very heavy.

Thanks.

This applies to pivot, however if you use a date_histogram in pivot, you don't need to keep old data from old buckets.

Let me give you an example: Assume you pivot on user data e.g. user_id and you calculate values based on the data you have for that user. If new data comes in for this user_id, the values for that user will be recalculated. If you meanwhile deleted data, the values might be wrong.

However, in some cases this might be what you want, e.g. if you want to transform data to get a view of the last X days.

With other words: if you care about a global view, without aging out, don't delete. If not, it might be ok to delete. Another example, if instead of user_id you pivot on order_id and an order is not changed after a certain amount of time, you can delete the source for it, because you won't recalculate the data for that order anymore.

The deletion of the source data does not trigger a recalculation, only adding new data with a field you pivot on does. For this case you need older data.

Coming back to the date_histogram: If you group daily and delete data that is older than a day, this isn't a problem, because transform would not recalculate data for that bucket. This is similar to the order_id example: If there is no trigger for recalculation you can safely delete the source for it.

I suggest to have a look at rollup, too. Transform is fine, but the intention of rollup is the data compaction/reduction use case.

Hi Hendrik. Thanks for the context. In some situations, we do want a global view of pivot like the user_id, so transform may not work for us in those cases. I have looked into rollup as well, but it doesn't support some of the aggregations we need like cardinality.

It seems we have to stick with keeping old data for those use cases (global/cardinality), but we will consider rollup or data histogram in other places.

To reduce the amount of data you have to keep you can consider stacking transforms together. E.g. to compact the data you could build daily/weekly or monthly summaries. On top of the output of such a transform you could run another transform that creates the final pivot you need.

With other words: The output of a transform is an index, therefore you can run another transform on this index. For cardinality this should work, for things like avg we have a weighted_avg aggregation to build averages on averages correctly.

Thanks. The transform over transform seems like an interesting solution.

Using an example where the raw table contains a record of each item purchase (user_id, item, timestamp) with cardinality.

We first run a continuous transform to convert the raw data into a summary of "the set of item each user has bought each day" - (user_id, date, item_sets)

We can then run another continuous transform on top of that to get "the set of item each user has bought since the beginning of time" - (user_id, item_sets)

This way, we can safely delete old data in the raw table since we won't have new records getting inserted with an old timestamp, so we won't trigger a recompute.

Is this the correct understanding?

yes, that's right.

Note that for the 2nd continuous transform you need a timestamp field to sync on. You can use the timestamp from the daily bucket, however in this case the delay for the 2nd transform needs to be 24h + the delay of the 1st transform. If that's to slow for you, you can either configure the 1st transform with a lower interval or you add an ingest timestamp using an ingest pipeline.

See: Dec 12th, 2018: [EN][Elasticsearch] Automatically adding a timestamp to documents

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.