Rollup Jobs Performance

I'm having issues with the performance of Rollup Jobs on my system. The Data Quantity is quite large

Job(1): rough 2,5M docs in 15 minutes / 300gb per day
Job(2): rough 250T docs in 15 minutes / 40gb per day

I'm trying to roll up the information on daily basis (24h) The max page size for the rollup job is 1000.

The indexing is taking forever,
and we're actually falling behind rather than catching up.

Below is an example of our job:

PUT _rollup/job/my-system-rollup
{
  "id": "my-system-rollup",
  "index_pattern": "my-system",
  "rollup_index": "my-system-stats",
  "cron": "0 30 */1 * * ?",
  "page_size": 1000,
  "groups": {
    "date_histogram": {
      "interval": "24h",
      "delay": "6h",
      "field": "@timestamp"
    },
    "terms": {
      "fields": [
        "field1",
        "field2",
        ...
        "field13"
      ]
    }
  }
}

Any ideas on how i could ehance the performance of such a job?

How can I find out which query is executed by the rollup job? I have seen in the STATS of the job that the search_time is very very high.

A few questions that might be able to point you in the right direction:

  1. What version of Elasticsearch are you using?
    • Elasticsearch in many of the newer releases has received significant improvements, so generally upgrading could be a "win"
    • If you're using Elasticsearch 8.5+, it is now recommended to use downsampling rather than rollup.
  2. What do your cluster metrics look like?
    • Do you have high disk latency maybe limiting how many records you can process?
    • Do you have high disk IO, also potentially limiting how many records you can process?
  3. What type of storage are you using for your cluster?
  4. Have you tried increasing the page_size setting?
    • Per the docs, increasing this should improve performance, but will also increase the amount of RAM used.
  5. Your example appears to show you're grouping by ~13 term, do many of these terms have high cardinality?
    • Per the docs, multiple high cardinality terms can have performance implications for rollups.

We currently use the version 8.3.3 of elasticsearch.

If I understand it correctly, downsampling is made for time series data, but unfortunately we don't have time series data.

According to the dashboard of our storage provider, there is currently no bottleneck or we do not see any problem at the moment.

We use Linstor (DRBD Block Storage), SSD for hot nodes and hdd for warm nodes

Yes, we tried to change the page_size to 10000, from the performance point of view we didn't find any real improvement.

Oh, I obviously overlooked that, this could already be the problem, of the 13 terms 5 are with high cardinality.

Have you looked into using transforms instead of rollup? I can't tell exactly, but Rollup has been in tech preview for a while, and it looks like Transforms have made it to GA, and with downsampling possibly replacing rollup v1, could be that transforms might be a better solution.

Thanks for the hint. I looked at it and it actually fits better for us because it is more flexible and also covers everything we need. We will develop the rollup job with this feature.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.