Transform behavior with deleted documents

nemhods · July 7, 2020, 1:49pm

Consider a source index that contains user authentications. I want to create a transform that contains, for each user, the oldest and newest login. It would look like this:

Group by user
Aggregate max(@timestamp) and min(@timestamp)
run continuously

So what happens to the min(@timestamp) of a user, if index lifecycle management deletes old authentications of that user?

On the next checkpoint, will the transform

keep the old min(@timestamp) even though the corresponding document was deleted?
set min(@timestamp) to the oldest authentication that is currently present in the dataset?

I guess what I'm asking is, does the transform keep persistent aggregation state over it's lifetime, or is it re-computed completely on every checkpoint?

Hendrik_Muhs · July 7, 2020, 3:18pm

Transform re-computes, so think of transform as a "materialized view" on the source data.

However, a continuous transform does not re-compute completely with every run, but it minimizes the update. It works in 2 steps: 1) identify changes, 2) apply changes. That means, if between checkpoint 323 and checkpoint 324, user a changes something, but user b doesn't, only the data for user a gets updated.

So the answer of you question is 2, if the the document is recomputed, if not, the document is unchanged.

We are aware of usecases, where people prefer to update instead of re-compute for the reasons you wrote. We are looking into those and might support update in the future.

nemhods · July 8, 2020, 6:02am

Hi Hendrik,

thanks a lot for the explanation. So to clarify, as this has performance implications that i was not aware of:

Consider my source dataset with user logins again. Let's say one of my users, bob, is particularly active, and has millions of logins recorded. If bob logs in again, the transform will run the full aggregation with a composite terms query "user.name=bob" on the whole dataset again?

This would mean that the performance impact of transforms scales with both the amount of changes per checkpoint, and also the amount of old documents for each changed entity. I originally thought that once a transform processed a document, it wouldn't have to touch it again and instead just combine aggregated data.

I'm looking forward to news about the "update instead of re-compute" functionality you mentioned, because I think it can greatly increase performance in use-cases like mine. Maybe it could work by keeping the output of previous checkpoints for each entity, and feeding it back into the combine or reduce phase? Anyways, looking forward to what you come up with.

Hendrik_Muhs · July 8, 2020, 8:15am

Hi nemhods,

yes, your summary is correct.

To explain a bit the challenge: Transform is a generic tool and supports a lot of different aggregations. To illustrate the "Update instead of Re-compute" problem:

min/max/sum are easy
for average we could store sum and count to make it update-able
to update a median you need a histogram, fortunately we have that now (histogram datatype)
for cardinality we have to store the sketch, e.g. the hyperloglog data structure, we do not have such a data type yet

This doesn't mean we do not want to support update at all. But likely we will not support every aggregation/data type to be update-able or at least add support step by step over time.

nemhods · July 8, 2020, 9:15am

Awesome, thanks for the clarification. I understand the difficulty. The transform that sparked this question even uses Scripted Metric Aggregations - I imagine this would be especially hard if not impossible to move to update instead of recompute.

system · August 5, 2020, 9:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can a transform recalculate if old documents update their values? Elasticsearch transforms	4	1285	June 11, 2021
How is an 'avg' aggregation updated in transforms when old records are deleted? Elasticsearch	3	377	May 11, 2020
Continuous transform doesn't use checkpoint timestamp to filter search Elasticsearch transforms	4	537	October 19, 2020
ES transforms behaviour change after upgrading from 7.10.0 to 7.16.1 Elasticsearch transforms	4	447	January 19, 2022
Transform Retention Policy Elasticsearch transforms	3	1223	February 7, 2022

Transform behavior with deleted documents

Related topics