Consider a source index that contains user authentications. I want to create a transform that contains, for each user, the oldest and newest login. It would look like this:
Group by user
Aggregate max(@timestamp) and min(@timestamp)
run continuously
So what happens to the min(@timestamp) of a user, if index lifecycle management deletes old authentications of that user?
On the next checkpoint, will the transform
keep the old min(@timestamp) even though the corresponding document was deleted?
set min(@timestamp) to the oldest authentication that is currently present in the dataset?
I guess what I'm asking is, does the transform keep persistent aggregation state over it's lifetime, or is it re-computed completely on every checkpoint?
Transform re-computes, so think of transform as a "materialized view" on the source data.
However, a continuous transform does not re-compute completely with every run, but it minimizes the update. It works in 2 steps: 1) identify changes, 2) apply changes. That means, if between checkpoint 323 and checkpoint 324, user a changes something, but user b doesn't, only the data for user a gets updated.
So the answer of you question is 2, if the the document is recomputed, if not, the document is unchanged.
We are aware of usecases, where people prefer to update instead of re-compute for the reasons you wrote. We are looking into those and might support update in the future.
thanks a lot for the explanation. So to clarify, as this has performance implications that i was not aware of:
Consider my source dataset with user logins again. Let's say one of my users, bob, is particularly active, and has millions of logins recorded. If bob logs in again, the transform will run the full aggregation with a composite terms query "user.name=bob" on the whole dataset again?
This would mean that the performance impact of transforms scales with both the amount of changes per checkpoint, and also the amount of old documents for each changed entity. I originally thought that once a transform processed a document, it wouldn't have to touch it again and instead just combine aggregated data.
I'm looking forward to news about the "update instead of re-compute" functionality you mentioned, because I think it can greatly increase performance in use-cases like mine. Maybe it could work by keeping the output of previous checkpoints for each entity, and feeding it back into the combine or reduce phase? Anyways, looking forward to what you come up with.
To explain a bit the challenge: Transform is a generic tool and supports a lot of different aggregations. To illustrate the "Update instead of Re-compute" problem:
min/max/sum are easy
for average we could store sum and count to make it update-able
to update a median you need a histogram, fortunately we have that now (histogram datatype)
for cardinality we have to store the sketch, e.g. the hyperloglog data structure, we do not have such a data type yet
This doesn't mean we do not want to support update at all. But likely we will not support every aggregation/data type to be update-able or at least add support step by step over time.
Awesome, thanks for the clarification. I understand the difficulty. The transform that sparked this question even uses Scripted Metric Aggregations - I imagine this would be especially hard if not impossible to move to update instead of recompute.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.