How is an 'avg' aggregation updated in transforms when old records are deleted?

I am creating a transformed index from a source index. (https://www.elastic.co/guide/en/elasticsearch/reference/master/transform-overview.html)
Logs older than 5 days are deleted from the source index. When computing an 'avg' metric aggregation, will the deleted logs be un-considered? Below is the code skeleton for reference.

POST _transform/_preview
{
  "source": {
    "index": "src_index",   
  },
  "dest": {
    "index": "transform_test"
  },
  "pivot": {
    "group_by": {
      "pivot_name": { "terms": {
        "field": "pivot_term"
      }}     
    },
    "aggregations": {
      "avg_val": { "avg": {
        "field": "field_name"
      }}      
    }
  }
}

The transform you posted is a so called batch transform, it will only run once and calculate the average based on the available data.

However, I assume you plan to turn this into a continuous transform. For a continuous transform the average is calculated at the time of data retrieval. If you delete data the bucket/document in the destination index keeps its value. In case the bucket is re-calculated, the average is recalculated as well and therefore changes due to the deleted data, too.

With other words: If your pivot_term is a unique id, this isn't a problem as the bucket would not be recalculated. If not, all aggregations are recalculated on the available data. In usecases like yours, it is useful to add min and max fields to know when the bucket has been recalculated last and its earliest data point. That way you can also filter out old buckets when you search on the destination index.

Your usecase looks like a data compaction one, rollup might be more suitable.

I hope this helps!

(We are looking into further transform improvements, so its very useful to us to hear about usecases like this.)

1 Like

That answers my question very clearly. I was indeed planning on doing a continuous transform, though it was not apparent from my question or code.

While data compaction is what I am trying to achieve, Rollup, by itself, will probably create multiple records for a unique pivot_term (each one for a bucket). I would want a single record for a unique pivot_term. I believe I can achieve that with some post-processing on the Rollup.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.