Impact of frequency value for continuous transform

gueri · July 10, 2024, 4:41pm

Hi,

I created a continuous transform job to run once a day and calculate aggregations about communications with IPs and protocole/port.

This is the skeleton of my transform

{
  "source": {
    "index": [
      "my_source_index"
      ],
      "query": {
      }
  },
  "dest": {
    "index": "my_dest_index"
  },
  "frequency": "1h",
  "sync": {
    "time": {
      "field": "@timestamp",
      "delay": "60s"
    }
  },
  "pivot": {
    "group_by": {
      "source.ip": {
        "terms": {
          "field": "source.ip"
        }
      },
      "destination.ip": {
        "terms": {
          "field": "destination.ip"
        }
      },
      "destination.port": {
        "terms": {
          "field": "destination.port"
        }
      },
      "network.protocol": {
        "terms": {
          "field": "network.protocol"
        }
      },
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "fixed_interval": "1d"
        }
      }
    },
    "aggregations": {
    }
  },
  "settings": {
    "max_page_search_size": 30000,
    "align_checkpoints": true
  }
}

In this configuration with "fixed_interval": "1d" a new index is created the day after. For example, today the July 10th I have my destination index 2024-07-09 and tomorow I will get the destination index 2024-07-10.

The result suits me well because I don't need fresh data (the previous day is enough) and so I have the key (source.ip, destination.ip, destination.port, network.protocol) only once in my index (the name of the destination index is setted each day by an ingest pipeline).

The source index collects many many logs each day, about 600 millions logs each 24 hours.

I only have a question about the frequency. I setted frequency on 1 hour (maximum value). But is it a good idea ?
should I set the frequency to a lower value ? Like 5m or 1m ?

I can imagine that with a low frequency the processing will be more spreaded over time and could be more efficient compared to a big load evrey hour.
Am I right ?

Thanks.

Eric

Patrick_Whelan · July 10, 2024, 5:55pm

I can imagine that with a low frequency the processing will be more spreaded over time and could be more efficient compared to a big load evrey hour.
Am I right ?

You are correct.

Since it doesn't matter if the data is a day old, I think it makes sense to look at if you want a spike of search traffic every hour or a constant load every 5m.

Since the frequency is 1h, assuming a flat rate of logs per hour, 6000000 / 24 ~= 250k docs for each hour. The max page search size of 30k will take 9 pages to process those docs. That 30k may be a large memory spike, and aggregating the 30k docs may cause a large search load. Lowering the page size will lower the memory and load, but it will take the transform longer to iterate through the docs. As long as the transform finishes its checkpoint within the hour (seems likely), it won't fall behind. That might help flatten out the impact (if that's desired).

Increasing the frequency (providing a smaller frequency number) would lower the amount of docs needed to search over, which will also reduce the memory and load.

gueri · July 11, 2024, 5:49pm

Hi,

Thank you Patrick for your explanation.
What you say is intersting!

I think you made a mistake about calculation.
I said 600 millions logs per day, so 600000000 / 24 ~= 25 millions docs for each hour.

Note that I filter logs in my query, so after query filtering I have approximately 150K logs each hour.

I tried to estimate the number of buckets after composite aggregation (source.ip, destination.ip, destination.port, network.protocol), it is between 60K and 80K buckets each hour. Each day it is approximately total of 300K different buckets.

To sum up for each hour: raw data = 25 millions logs -> query filter = 150K logs -> aggregation = 80K buckets max

In your explanation, you divide the max page search size by the number of logs to find the number of pages.

But I notice in the Elastic documentation : "The max_page_search_size transform configuration option defines the number of buckets that are returned for each search request."

So do you think we have to divide by the number of logs or by the number of buckets ?

Regards,
Eric

Patrick_Whelan · July 11, 2024, 10:25pm

It is the number of composite buckets.

Yep, I totally read it as 6 million, add two zeros to the end of everything

Topic		Replies	Views
Questions related to transforms limitations Elasticsearch transforms	2	551	July 26, 2021
Understanding continous transforms syncing Elasticsearch transforms	5	4060	November 19, 2020
Transform API is not updating automaticly when i add some data in the source index Kibana transforms	2	375	February 10, 2022
How to force Transform to run periodically? Elasticsearch	7	891	May 5, 2020
Continuous transform of a transform destination index Elasticsearch transforms	1	110	May 16, 2024

Impact of frequency value for continuous transform

Related topics