Is there a soon update for transforms API

marone · October 1, 2021, 2:43pm

I'm using transforms API for a use-case, where I have to run it continuously, the limitation described here list that the max frequency I can set is every 1h. But I'm planning to run it at night for documents indexed last 24h. is there an update coming soon or any workaround please ?

Hendrik_Muhs · October 6, 2021, 7:05am

Better scheduling is planned for a future release, however I can't give you an exact date when this is coming.

Are you using a date_histogram in your transform, configured with 1d interval?

Starting with 7.15 a date_histogram will only process complete buckets by default, complete means in this case that 24 hours must have passed. The new setting is called align_checkpoints. the default is true.

With other words: Although transform triggers e.g. every hour, it will not do any work and therefore not cause any harm. It is just 1 cheap query every hour. Even if you set frequency to a lower value you probably won't notice. Once transforms runs the next day and once all data of the last day is available, it will create the bucket(s) of the last day.

Nevertheless we have plans for better scheduling, e.g. to specify exactly when to run a transform.

marone · October 7, 2021, 8:09am

Thank you for you reply, that's interesting! so what I understand for 7.15, in order for my transformation to process documents ingested in the last 24 hours, I have to use a date_histogram so it will create a bucket for it and then execute a transformation. But my use case for my transformation doesn't use any date_histogram, instead I use a pivot, group_by and a scripted_metric aggs. Here is what my transformation looks like:

POST _transform/_preview
{
  "source": {
    "index": "source_index"
  },
  "dest": {
    "index": "dest_index"
  },
  "pivot": {
    "group_by": {
      "id": {
        "terms": {
          "field": "id"
        }
      }
    },
    "aggregations": {
      "latest_ts": {
        "scripted_metric": {
          "init_script": "state.timestamp_latest = 0L;",
          "map_script": """
          def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli();
          if (current_date > state.timestamp_latest && doc['host.name'].getValue() =~ /hostname2/)
          {state.timestamp_latest = current_date;}
        """,
          "combine_script": "return state",
          "reduce_script": """
          def last_doc = '';
          def timestamp_latest = 0L;
          for (s in states) {if (s.timestamp_latest > (timestamp_latest))
          {timestamp_latest = s.timestamp_latest;}}
          return timestamp_latest
        """
        }
      },
      "first_ts": {
        "scripted_metric": {
          "init_script": "state.timestamp_first = 999999999999999L;",
          "map_script": """
          def current_date = doc['@timestamp'].getValue().toInstant().toEpochMilli();
          if (current_date < state.timestamp_first && doc['host.name'].getValue() =~ /hostname1/)
          {state.timestamp_first = current_date;}
        """,
          "combine_script": "return state",
          "reduce_script": """
          def last_doc = '';
          def timestamp_first = 999999999999999L;
          for (s in states) {if (s.timestamp_first < (timestamp_first))
          {timestamp_first = s.timestamp_first;}}
          return timestamp_first
        """
        }
      },
      "time_length": {
        "bucket_script": {
          "buckets_path": {
            "min": "first_ts.value",
            "max": "latest_ts.value"
          },
          "script": """
          (params.max - params.min)/1000
           
          """
        }
      }
    }
  }
}

is it possible to include a date_histogram for my transformation in order to be optimized?

Hendrik_Muhs · October 7, 2021, 9:36am

No, you don't have to use a date_histogram. I assumed you are using one, because you mentioned the 24 hour requirement. The date_histogram in addition to your terms group_by would allow you to look back in history.

Regarding your configuration:

latest_ts and first_ts return the latest and first timestamp. I wonder why you are not using min and max. This should work for calculating time_length, the docs contain a similar example.

I suggest to try this out, an additional date_histogram will create more buckets, so more documents. As said above, if you do this in addition you preserve historic values which might be useful. Note, the destination index will contain several entries by id, in order to see the latest one you can e.g. use a top_hits aggregation.

marone · October 7, 2021, 2:33pm

Thank you, well I didn't use max and min because I had to select first_ts and latest_ts based on a value of field host.name, that's why I had to do it with as above ^^

Hendrik_Muhs · October 7, 2021, 3:03pm

Got it. It might be possible to use the filter aggregation and min/max as sub aggregation.

system · November 4, 2021, 3:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Continuous transform doesn't use checkpoint timestamp to filter search Elasticsearch transforms	4	539	October 19, 2020
Transforms: do I need to filter source for time-series data? Elasticsearch transforms	10	1164	July 16, 2021
How to force Transform to run periodically? Elasticsearch	7	906	May 5, 2020
Transform API is not updating automaticly when i add some data in the source index Kibana transforms	2	375	February 10, 2022
Transform missing data Elasticsearch transforms	3	1170	August 11, 2022

Is there a soon update for transforms API

Related topics