Is there a soon update for transforms API

I'm using transforms API for a use-case, where I have to run it continuously, the limitation described here list that the max frequency I can set is every 1h. But I'm planning to run it at night for documents indexed last 24h. is there an update coming soon or any workaround please ?

Better scheduling is planned for a future release, however I can't give you an exact date when this is coming.

Are you using a date_histogram in your transform, configured with 1d interval?

Starting with 7.15 a date_histogram will only process complete buckets by default, complete means in this case that 24 hours must have passed. The new setting is called align_checkpoints. the default is true.

With other words: Although transform triggers e.g. every hour, it will not do any work and therefore not cause any harm. It is just 1 cheap query every hour. Even if you set frequency to a lower value you probably won't notice. Once transforms runs the next day and once all data of the last day is available, it will create the bucket(s) of the last day.

Nevertheless we have plans for better scheduling, e.g. to specify exactly when to run a transform.

2 Likes

Thank you for you reply, that's interesting! so what I understand for 7.15, in order for my transformation to process documents ingested in the last 24 hours, I have to use a date_histogram so it will create a bucket for it and then execute a transformation. But my use case for my transformation doesn't use any date_histogram, instead I use a pivot, group_by and a scripted_metric aggs. Here is what my transformation looks like:

POST _transform/_preview
{
  "source": {
    "index": "source_index"
  },
  "dest": {
    "index": "dest_index"
  },
  "pivot": {
    "group_by": {
      "id": {
        "terms": {
          "field": "id"
        }
      }
    },
    "aggregations": {
      "latest_ts": {
        "scripted_metric": {
          "init_script": "state.timestamp_latest = 0L;",
          "map_script": """
          def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli();
          if (current_date > state.timestamp_latest && doc['host.name'].getValue() =~ /hostname2/)
          {state.timestamp_latest = current_date;}
        """,
          "combine_script": "return state",
          "reduce_script": """
          def last_doc = '';
          def timestamp_latest = 0L;
          for (s in states) {if (s.timestamp_latest > (timestamp_latest))
          {timestamp_latest = s.timestamp_latest;}}
          return timestamp_latest
        """
        }
      },
      "first_ts": {
        "scripted_metric": {
          "init_script": "state.timestamp_first = 999999999999999L;",
          "map_script": """
          def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli();
          if (current_date < state.timestamp_first && doc['host.name'].getValue() =~ /hostname1/)
          {state.timestamp_first = current_date;}
        """,
          "combine_script": "return state",
          "reduce_script": """
          def last_doc = '';
          def timestamp_first = 999999999999999L;
          for (s in states) {if (s.timestamp_first < (timestamp_first))
          {timestamp_first = s.timestamp_first;}}
          return timestamp_first
        """
        }
      },
      "time_length": {
        "bucket_script": {
          "buckets_path": {
            "min": "first_ts.value",
            "max": "latest_ts.value"
          },
          "script": """
          (params.max - params.min)/1000
           
          """
        }
      }
    }
  }
}

is it possible to include a date_histogram for my transformation in order to be optimized?

No, you don't have to use a date_histogram. I assumed you are using one, because you mentioned the 24 hour requirement. The date_histogram in addition to your terms group_by would allow you to look back in history.

Regarding your configuration:

latest_ts and first_ts return the latest and first timestamp. I wonder why you are not using min and max. This should work for calculating time_length, the docs contain a similar example.

I suggest to try this out, an additional date_histogram will create more buckets, so more documents. As said above, if you do this in addition you preserve historic values which might be useful. Note, the destination index will contain several entries by id, in order to see the latest one you can e.g. use a top_hits aggregation.

1 Like

Thank you, well I didn't use max and min because I had to select first_ts and latest_ts based on a value of field host.name, that's why I had to do it with as above ^^

Got it. It might be possible to use the filter aggregation and min/max as sub aggregation.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.