Hi,
I created a continuous transform job to run once a day and calculate aggregations about communications with IPs and protocole/port.
This is the skeleton of my transform
{
"source": {
"index": [
"my_source_index"
],
"query": {
}
},
"dest": {
"index": "my_dest_index"
},
"frequency": "1h",
"sync": {
"time": {
"field": "@timestamp",
"delay": "60s"
}
},
"pivot": {
"group_by": {
"source.ip": {
"terms": {
"field": "source.ip"
}
},
"destination.ip": {
"terms": {
"field": "destination.ip"
}
},
"destination.port": {
"terms": {
"field": "destination.port"
}
},
"network.protocol": {
"terms": {
"field": "network.protocol"
}
},
"@timestamp": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "1d"
}
}
},
"aggregations": {
}
},
"settings": {
"max_page_search_size": 30000,
"align_checkpoints": true
}
}
In this configuration with "fixed_interval": "1d" a new index is created the day after. For example, today the July 10th I have my destination index 2024-07-09 and tomorow I will get the destination index 2024-07-10.
The result suits me well because I don't need fresh data (the previous day is enough) and so I have the key (source.ip, destination.ip, destination.port, network.protocol) only once in my index (the name of the destination index is setted each day by an ingest pipeline).
The source index collects many many logs each day, about 600 millions logs each 24 hours.
I only have a question about the frequency. I setted frequency on 1 hour (maximum value). But is it a good idea ?
should I set the frequency to a lower value ? Like 5m or 1m ?
I can imagine that with a low frequency the processing will be more spreaded over time and could be more efficient compared to a big load evrey hour.
Am I right ?
Thanks.
Eric