Questions related to transforms limitations

Hello,

I am experimenting Transforms and I would to confirm that the limitations below are true:

  1. Doc removal detection in source and reflection upon next sync. If a doc is removed at source the value remains in the destination, however, if you add a doc in source index with the same aggregation keys the destination gets updated accordingly later.
  2. Dynamic destination index creation using field values like logstash does, I would assume one needs to create several transforms with filters to mimic this behavior.
  3. Option to associate a pipeline when creating the Transforms job (I believe I saw a field to define it, but I may be wrong)

I would like to take the opportunity to mention that it took me some attempts to understand what Frequency vs Delay really do because of the statement in quotes below. What is the objective of having a frequency ranging from 1s to 1h if the delay parameter is the one that truly does the changes in the destination index? I ask this because it makes more sense to have the Frequency and Delay being the same value, unless one wants to spread both more to save resources, also it is not clear which part costs more (checks or indexing)

Frequency

"In a continuous transform, the frequency configuration option sets the interval between checks for changes in the source indices. If changes are detected, then the source data is searched and the changes are applied to the destination index. Depending on your use case, you may wish to reduce the frequency at which changes are applied. By setting frequency to a higher value (maximum is one hour), the workload can be spread over time at the cost of less up-to-date data."

Thank you

Correct, deletes are invisible when detecting changes.

Usually a transform has 1 destination index, however by writing into a ingest pipeline you can similar to logstash change the destination index by rewriting _index. If you do this, note that transform can't manage the index, but you have to manage it yourself.

Yes, see the dest/pipeline.

As for frequency vs. delay:

Frequency control how often the transform runs, how often it gets updated.

Delay defines a time window, where data can arrive late due to various delays that happen during ingest and indexing, at minimum this is the refresh interval of Lucene indexes (default 1s), other delays are time spend in ingest or logstash, time spent for sending data from the origin, etc.

For example imagine an application sends telemetry data, for performance reasons the daemon flushes data every 10s, afterwards the data gets queued(2s), processed (2s) and finally indexed (1s). Delay is the sum: (10 + 2 + 2 + 1)s. If for some reason these 15s are breached, transform might not see the update.

To rule out potential problems, you can instead of using a timestamp value created externally use a ingest timestamp that you set as last step in an ingest pipeline. In the example above, this would allow you to set delay to e.g. 2s to only account for the index refresh interval.

Regarding resource usage: Delay doesn't make any difference w.r.t. performance, because it is just an offset. However, frequency controls how often you check for updates, more recency costs performance.

I agree Frequency and Delay are quite complex, we are working on simplifications for the future.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.