Best practices for continuous transform with fixed gte filter and tiered storage (hot/warm indices)

I'm using an Elasticsearch continuous transform to aggregate data from time-based indices (These are daily indices which contain time sensitive documents). My goal is to have the transform only process data starting from a specific date (2025-04-01). To do this, I added a gte filter on the @timestamp field in the transform's source query:

"range": {
  "@timestamp": {
    "gte": "2025-04-01T00:00:00Z"
  }
}

I understand that this filter applies only to the initial run, and from then on, the transform uses checkpointing to remember the last synced @timestamp. So in theory, the transform should only process new data since the last checkpoint, and not re-scan all documents from the original gte date.

However, I am not 100% sure that the gte filter could cause the transform to repeatedly access older indices — including those moved to the warm tier (we move data older than 30 days into warm so daily indices more than 30 days old move into the warm tier). I would like to know whether the fixed gte might cause the transform to keep scanning data from that point forward, which could lead to performance issues when older indices are in the warm tier.

Based on my reading of how the sync.time.field and checkpointing work, I believe this concern isn't applicable: the transform should not re-scan data before the last checkpoint and thus won’t query old warm indices unnecessarily.

Could anyone from the community or the Elastic team confirm how the transform will behave?

Also, are there best practices around using gte with continuous transforms and tiered storage?

Thanks in advance!

By default, the Transform will process any data for each checkpoint, and it may process all data. In the scenario you describe, it may query old warm indices unnecessarily.

A Transform running continuously will perform three searches.

  1. The first search determines if the checkpoint runs. It queries the index for any changes since the previous checkpoint. If there are any hits, the checkpoint begins. If there are no hits, the Transform will drop the checkpoint and will check again after the frequency time elapses.
  2. If the checkpoint begins, then the second search determines what has changed. The second search will grab the documents for that checkpoint (it requests any documents ingested since the last synced @timestamp). Transform processes these documents into the declared entities, where an entity is a value declared by the group_by configuration. These are the entities that have changed since the last checkpoint ran, and the Transform needs to update them.
  3. The third search recomputes the value for the changed entities. The third search request grabs all documents related to each changed entity. This search goes as far back as configured, which in this case would be gte 2025-04-01T00:00:00Z. Transform uses these hits to recompute the entity, and the new value for the entity is saved to the destination index.

There are configurations that limit the range for that third search.

For example, if a group_by includes a date_histogram set to 1h, then the Transform will only query for the documents in that given hour. The resulting entity will be a representation of that hour, and the destination index will have one of those entities per hour.

Another example is to use a relative range in the query. Instead of gte 2025-04-01T00:00:00Z, the Transform query could use gte now-1M/M. This is more like a sliding window, where every time the Transform runs a checkpoint, the resulting entity represents the previous month of data. This will limit the amount of time included in the third search, but it also means older data gets replaced or excluded from the results.