Can an Elasticsearch rollup job dynamically create indexes like Logstash does?

I am currently testing out the new rollup APIs in Elasticsearch 6.3 and am wondering if there is any way to configure the rollup job to dynamically create an index based on timestamp like Logstash does when ingesting data? The use case is to try and roll up large amounts of time series network performance reporting data and I'm worried that even an hourly rollup will create a huge index to manage so am looking to split it to have one index for each day's hourly rollup.

Current rollup job config:

{
    "index_pattern": "dxs-raw-*",
    "rollup_index": "dxs-hourly-%{+YYYY.MM.dd}",
    "cron": "* */15 * * * ?",
    "page_size": 1000,
    "groups": {
        "date_histogram": {
            "field": "@timestamp",
            "interval": "1h",
            "delay": "12h"
        },
        "terms": {
            "fields": ["ci_id.keyword", "client_id.keyword", "element_name.keyword", "measurement.keyword", "source_management_platform.keyword", "unit.keyword"]
        }
    },
    "metrics": [
        {
            "field": "value",
            "metrics": ["min", "max", "avg"]
        }
    ]
}

Error seen when PUTting job via Kibana DevTools console:

{
    "error": {
        "root_cause": [
        {
            "type": "invalid_index_name_exception",
            "reason": "Invalid index name [dxs-hourly-%{+YYYY.MM.dd}], must be lowercase",
            "index_uuid": "_na_",
            "index": "dxs-hourly-%{+YYYY.MM.dd}"
        }
        ],
        "type": "runtime_exception",
        "reason": "runtime_exception: Could not create index for rollup job [dxs-hourly]",
        "caused_by": {
        "type": "invalid_index_name_exception",
        "reason": "Invalid index name [dxs-hourly-%{+YYYY.MM.dd}], must be lowercase",
        "index_uuid": "_na_",
        "index": "dxs-hourly-%{+YYYY.MM.dd}"
        }
    },
    "status": 500
}

At the moment, no... there's no way to do that. Right now we're sorta waiting on the Index Lifecycle Mangement feature to become available, since that will allow this sort of management a lot easier than baking parts of the logic into the Rollup feature. But that's still a work in progress.

It might be possible to use the Rollover API today, but it's entirely untested so I'm not sure. E.g. create a write alias, setup some rollover rules, then point the Rollup config at the alias.

The tricky bit is that Rollup's disaster recover mechanism (like if a node goes down) is to simply backtrack to the last checkpoint and overwrite any documents that were written after the checkpoint. So you'd want to make sure that doesn't happen around a rollover event. A procedure like:

  1. Stop rollup job, wait for it to finish and checkpoint
  2. Hit Rollover api to check conditions
  3. Restart rollup job

The only other alternative is to pin the job to a specific source index and create lots of jobs, each representing that single index it rolled up.

Note: Rollup right now doesn't allow searching across multiple rollup indices (technical limitations internally), so you'd be stuck searching just the single hourly rollup index which is highly non-ideal :confused:

How many rollup documents is your job generating for each hour, or other timeframe?

Thanks Zachary, I will try the alias idea out and report back :slight_smile:.

The conservative estimate we have is ~1 million metrics if it was a single hourly index. Given our document size this equates to about 160GB of data. If what I'm building manages to get high adoption that number could be 3-4x higher. Our "raw" data is stored in daily indexes of ~72,000,000 documents in size (~12GB).

We are essentially rolling up progressively so once we get past hourly the numbers (using the same numbers as above) would look like:
Daily = 90,000,000 / 13.41GB
Weekly = ~20,000,000 / 2.87GB
Monthly = 6,000,000 / 0.89GB
Quarterly = 3,000,000 / 0.45GB
Yearly = 1,000,000 / 0.15GB

Hey Zachary, did try the alias and rollover idea out and still no luck I'm afraid. Here is what I tried as a test to rollup every ten minutes into hourly rollover indexes:

Created a new index with an alias:

PUT /%3Cdxs-rolled-hourly-%7Bnow%2FH-1H%7BYYYY.MM.dd.HH-%7D%7D%3E
{
    "aliases": {
        "dxs-rolled-hourly": {}
    }
}

Created a new rollup job targeting the alias:

PUT _xpack/rollup/job/dxs-hourly
{
    "index_pattern": "dxs-raw-*",
    "rollup_index": "dxs-rolled-hourly",
    "cron": "*/5 * * * * ?",
    "page_size": 1000,
    "groups": {
        "date_histogram": {
            "field": "@timestamp",
            "interval": "10m",
            "delay": "30m"
        },
        "terms": {
            "fields": ["ci_id.keyword", "client_id.keyword", "element_name.keyword", "measurement.keyword", "source_management_platform.keyword", "unit.keyword"]
        }
    },
    "metrics": [
        {
            "field": "value",
            "metrics": ["min", "max", "avg"]
        }
    ]
}

Received the following error:

{
  "error": {
    "root_cause": [
      {
        "type": "invalid_index_name_exception",
        "reason": "Invalid index name [dxs-rolled-hourly], already exists as alias",
        "index_uuid": "_na_",
        "index": "dxs-rolled-hourly"
      }
    ],
    "type": "runtime_exception",
    "reason": "runtime_exception: Could not create index for rollup job [dxs-hourly]",
    "caused_by": {
      "type": "invalid_index_name_exception",
      "reason": "Invalid index name [dxs-rolled-hourly], already exists as alias",
      "index_uuid": "_na_",
      "index": "dxs-rolled-hourly"
    }
  },
  "status": 500
}

So to me looks like the index is being created rather than referenced (thus the error when I used the same name as the alias).

Since it looks like this isn't possible today what would you recommend as the approach to get this raised as a potential enhancement?

Ah, right, I forgot about that. When you create a job, it first tries to create the destination rollup index. If it exists, it updates the existing rollup metadata to include a new job.

But if it's an alias, it'll just throw an exception like you saw. I think that should be fixable. I'll open a ticket in a few minutes with this enhancement request, there are a few routes we could take (fix for aliases, support for rollover internally, ILM, etc).

The conservative estimate we have is ~1 million metrics if it was a single hourly index. Given our document size this equates to about 160GB of data. If what I'm building manages to get high adoption that number could be 3-4x higher. Our "raw" data is stored in daily indexes of ~72,000,000 documents in size (~12GB).

Sorry, a bit confused here. Are you saying one hour of rolled up data generates 160gb, but the raw data of 72m docs is only 12gb? Ditto for some of the other numbers, like 90m/13.41GB in rollup compared to 72m raw?

I think I may just be misunderstanding something here, sorry!

FYI, I opened an issue to track this: https://github.com/elastic/elasticsearch/issues/33065

1 Like

Thanks Zachary! Sorry, the way I phrased the sizing isn't very clear, here is what we are trying to do:

  1. Collect performance metrics for devices every 5 minutes and store them in an index for each day. Estimate is a document size of about 160 bytes, 50 metrics per device, collected every 5 minutes, for 5000 devices. That gets to our first figure of 72,000,000 documents per day which would be stored in a daily index which comes out at just under 12GB for each daily index. We are planning to retain each daily index for 90 days which ends up being just under 1TB of data.
  2. As this is a significant amount of data we then wanted to create rollups for hourly, daily, weekly, monthly, quarterly, and yearly averages for all metrics. Ideally in manageable index sizes so we were hoping to have a daily index for hourly rollups, a weekly index for daily rollups, etc. If we had to store all the hourly rollup documents for the planned six month retention period in a single index that index would be ~160GB in size.

Does that help clear it up?

Gotcha, that makes sense. Thanks for laying it all out... it helps me get a clearer idea of the scale we're talking.

Some thoughts in no particular order:

  • We've been actively discussing how to do some kind of partitioning scheme, will post updates to that ticket as we work on it.

  • Is that 160gb size for the six-month period extrapolated from smaller Rollup tests, or just theoretical size based on the raw data? I'm asking because the Rollup documents have a different structure than the original, which has different compression characteristics (even ignoring the rollup effect, just due to data layout/format/density). Curious to see what kind of compression values you're getting over the raw data for a specific period of time.

  • I know you said you expect the project to grow, but for now I think 160gb is right on the edge of OK. With five shards, a 160gb index comes out to ~30gb/shard, which is perfectly reasonable. Logging/metrics use-cases can easily push 50-100gb/shard. But I agree that much more volume and some kind of rollover situation is probably in order.

  • What data type is your original data? All long and float/double? Or do you also use the newer half_float and scaled_float? Asking because right now Rollup stores all metrics as double, which are quite bulky... so if we enable float/half_float/scaled_float it would compress better.

Thanks for all the info! Being new and experimental, real user data like this is super helpful for how we should tweak/optimize/enhance Rollup. :slight_smile:

Also, an unrelated note but since you're actively testing Rollup, I thought I should mention it. We fixed a fairly serious issue with the document IDs used by Rollup. I'd suggest upgrading to 6.4 before any production data is permanently stored :slight_smile:

Thanks for all the help Zach, to answer your questions:

  1. The 160GB was a theoretical size based on the existing raw data. I'm running a rollup job in my QA environment at the moment (on 6.4 :slight_smile:) and will let you know how it goes.

  2. Data is currently float, I hadn't seen the new data types. We could definitely go to half_float and potentially scaled float. I will chat to the team about that to save some space as well. Having those types in the rollup would be great.

Great, thanks! Looking forward to what your tests show :slight_smile:

Ok got it working (can I suggest a docs update that mentions Rollup doesn't play nicely with index templates? :slightly_smiling_face:).

What I end up with is a rollup index with a document size of ~210 bytes which if we use the projections is actually a lot bigger than I was predicting at ~220GB so I definitely think the different data types would help with the compression.

But we are pretty happy with the results. So in the end we are looking to go with a single rollup index for each of our rollup types spilt into 10 shards.

Sure... what bit was the difficulty here? Broadly matching patterns being applied to rollup index on accident or something similar?

What I end up with is a rollup index with a document size of ~210 bytes which if we use the projections is actually a lot bigger than I was predicting at ~220GB so I definitely think the different data types would help with the compression.

But we are pretty happy with the results. So in the end we are looking to go with a single rollup index for each of our rollup types spilt into 10 shards.

Sounds good, thanks for the update! I think adding support for scaled_floats will help a bunch, since they are vastly smaller than double. Even float/half-float would offer a significant savings. Just a guess, but I think most folks will be ok with the precision trade off here given it's rolled data.

We're also looking into what we can turn off, e.g. do we really need to index the values for search, or is it ok to just generate doc_values. Things like that should improve size over time as we implement it.

Hey Zach, yes that was the exact problem with the index templates. It surfaced strangely though in the sense that the rollup job created successfully but just didn't populate with data until I removed the index template and recreated the job.

That all sounds excellent I can definitely voice at least one agreement on the tradeoff between precision and size :slight_smile:.

Thanks again for all your help on this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.