Impact of frequency value for continuous transform

Hi,

I created a continuous transform job to run once a day and calculate aggregations about communications with IPs and protocole/port.

This is the skeleton of my transform

{
  "source": {
    "index": [
      "my_source_index"
      ],
      "query": {
      }
  },
  "dest": {
    "index": "my_dest_index"
  },
  "frequency": "1h",
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "source.ip": {
        "terms": {
          "field": "source.ip"
        }
      },
      "destination.ip": {
        "terms": {
          "field": "destination.ip"
        }
      },
      "destination.port": {
        "terms": {
          "field": "destination.port"
        }
      },
      "network.protocol": {
        "terms": {
          "field": "network.protocol"
        }
      },
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "1d"
        }
      }
    },
    "aggregations": {
    }
  },
  "settings": {
    "max_page_search_size": 30000,
    "align_checkpoints": true
  }
}

In this configuration with "fixed_interval": "1d" a new index is created the day after. For example, today the July 10th I have my destination index 2024-07-09 and tomorow I will get the destination index 2024-07-10.

The result suits me well because I don't need fresh data (the previous day is enough) and so I have the key (source.ip, destination.ip, destination.port, network.protocol) only once in my index (the name of the destination index is setted each day by an ingest pipeline).

The source index collects many many logs each day, about 600 millions logs each 24 hours.

I only have a question about the frequency. I setted frequency on 1 hour (maximum value). But is it a good idea ?
should I set the frequency to a lower value ? Like 5m or 1m ?

I can imagine that with a low frequency the processing will be more spreaded over time and could be more efficient compared to a big load evrey hour.
Am I right ?

Thanks.

Eric

I can imagine that with a low frequency the processing will be more spreaded over time and could be more efficient compared to a big load evrey hour.
Am I right ?

You are correct.

Since it doesn't matter if the data is a day old, I think it makes sense to look at if you want a spike of search traffic every hour or a constant load every 5m.

Since the frequency is 1h, assuming a flat rate of logs per hour, 6000000 / 24 ~= 250k docs for each hour. The max page search size of 30k will take 9 pages to process those docs. That 30k may be a large memory spike, and aggregating the 30k docs may cause a large search load. Lowering the page size will lower the memory and load, but it will take the transform longer to iterate through the docs. As long as the transform finishes its checkpoint within the hour (seems likely), it won't fall behind. That might help flatten out the impact (if that's desired).

Increasing the frequency (providing a smaller frequency number) would lower the amount of docs needed to search over, which will also reduce the memory and load.

Hi,

Thank you Patrick for your explanation.
What you say is intersting!

I think you made a mistake about calculation.
I said 600 millions logs per day, so 600000000 / 24 ~= 25 millions docs for each hour.

Note that I filter logs in my query, so after query filtering I have approximately 150K logs each hour.

I tried to estimate the number of buckets after composite aggregation (source.ip, destination.ip, destination.port, network.protocol), it is between 60K and 80K buckets each hour. Each day it is approximately total of 300K different buckets.

To sum up for each hour: raw data = 25 millions logs -> query filter = 150K logs -> aggregation = 80K buckets max

In your explanation, you divide the max page search size by the number of logs to find the number of pages.

But I notice in the Elastic documentation : "The max_page_search_size transform configuration option defines the number of buckets that are returned for each search request."

So do you think we have to divide by the number of logs or by the number of buckets ?

Regards,
Eric

It is the number of composite buckets.

Yep, I totally read it as 6 million, add two zeros to the end of everything