How to force Transform to run periodically?

Hi there!

Is there any way to run Transform periodically (e.g., at each 5 seconds) regardless of having an update in the sync date field of source index?

I'm trying to use Transform to aggregate values ​​in a sliding window over the time (e.g., now-24h), but new values ​​in source index only arrive when a new event occurs. So, my sliding window should not consider the first old values ​​as the time goes by.

I already tried to use only the "frequency" param of Transform (without the "sync" param), but it was created in batch mode, not continuous. If I use "sync" param, transform remains waiting for a doc datetime update to execute.

Transform script:

PUT _transform/my_transform
{
  "source": {
    "index": [
      "source_index"
    ],
    "query": {
      "bool": {
        "must": [
          {
            "range": {
              "datetime": {
                "gte": "now-24h"
              }
            }
          }
        ]
      }
    }
  },
  "dest": {
    "index": "dest_index"
  },
  "frequency": "5s",
  "pivot": {
    "group_by": {
      "code": {
        "terms": {
          "field": "code.keyword"
        }
      }
    },
    "aggregations": {
      "agg24h": {
        "sum": {
          "field": "value"
        }
      }
    }
  }
}

Any suggestion? Is there any other best way to do that?

Regards

1 Like

The described usecase is not possible at the moment.

If I understand correctly you would like to run the full transform every 5 seconds?

A continuous transform is optimized to only update changed entities, if I understand correctly you are looking for updating the full dataset.

I wonder about your usecase. Are you running further analysis on this data? Otherwise you could simply run the aggregation at query time. Why do you not need a transform?

Hi Hendrik, thanks for your reply.

Yes, I'm trying to update the full dataset. It is because my dataset is composed of rainfall data coming from rain gauge equipment (thousands of them). It is important to users (also mathematical models) to know the precipitation accumulation for the past 1, 3, 6, 12, 24 hours to 5 days.

We thought to perform aggregation at query time, but as the same data will be consumed many times for different users, we start to investigate some efficient method to do that. The other option is to execute an aggregation query in a pipeline or external process (e.g., Java or Phyton) and output results into an index.

Do you have any suggestion? Do you think Transform would incorporate this functionality in a future version?

This sounds like https://github.com/elastic/elasticsearch/issues/53798

Does this cover your usecase?

Yes, sounds to be a similar use case, except that we are not using ML. I saw it was tagged as enhancement. Do you think it would be a new feature in future releases?

Yes, I think we will eventually add this. However, I can't say when.

Okay, thank you so much.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.