Transforms on data older than a specific timestamp: checkpoints are not created

Continuing the discussion from ES Continuous Transforms Checkpoint not updating:

Hi,
I'm creating a transform job that transforms data in an index and takes all documents OLDER THAN 6 days (calculating averages etc).
When I run the job I notice that only one checkpoint is created and no more.
I have configured continuous mode, a @timestamp field and a 60sec delay value.

However, each 60 seconds a query for data older than 6 days will result in new data (with timestamp of 6days-60seconds ago to 6days ago) so I expect a new checkpoint for this data to be created.
This does not happen. Why? I see in the transform statistics

      "checkpointing" : {
        "last" : {
          "checkpoint" : 1,
          "timestamp_millis" : 1644238548355,
          "time_upper_bound_millis" : 1644238440000
        },
        "operations_behind" : 12850,
        "changes_last_detected_at" : 1644238548355
      }

that 12850 documents more are processed but are apparently not checkpointed.

Is there an explanation for this? Is transform only designed for examining documents NEWER than a specific timestamp?

Continuous mode is configured on a date field. The timestamp of that field is used for checkpointing. In your example time_upper_bound_millis = 1644238440000 - which translates to 02/08/2022 @ 7:24am in UTC is the time upper bound of the checkpoint. If you push data before that time transform is not able to query for this data. That's by design.

Note the difference between timestamp_millis and time_upper_bound_millis. time_upper_bound_millis is calculating taking the system time, deducting delay. In this case the value is additionally rounded down to a bucket boundary, because you use a date_histogram in your transform configuration.

To workaround your problem you have 2 options:

  • increase delay: if you increase delay to e.g. 6d transform won't miss any data that arrives now - 6d, however it also won't process any data in between.
  • use an ingest timestamp for sync: If you add another timestamp field in your source data which contains the date when the data has been ingested/indexed in Elasticsearch you can configure sync to use the ingest timestamp while you can still pivot on another timestamp field.
2 Likes

Thx, I overlooked that delay setting. This works perfectly, and we don't need to process any data in between, only 'older data'.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.