ILM handling out of order ingest

Background

  1. Current production cluster (6.5.x) has time series indices with hourly naming convention indexname-YYYYMMDDHH. We see variance with index sizes because our time series data has seasonality. It causes the cluster to have sub-optimal disk usage, because shard balancing is done by shard count, not size.
  2. Current cluster has Curator set up that performs several similar actions as ILM (delete, force merge)

Goals

  1. Rollover by Size: We're planning to do an upgrade to (6.8.x) to enable ILM, so that we can rollover by the index size
  2. Clear backlog in LIFO order: When there's backlog of documents from any ES outage or issue with ingest (ES-Hadoop in our case), we want the latest data to be ingested so that latest insight is ingested to the cluster first.
  3. Age off/Delete indices: indices that are older than x days are deleted
  4. Set up hot-warm-cold architecture: to decrease search response time with latest data. We're planning to set up "Warm" nodes with lower Storage-to-RAM ratio.

Questions

Consider the following operation scenario. Ingest was stopped for 8 hours (2019-10-21 3am). When re-enabling the ingest (2019-10-21 11am), ES-Hadoop ingest clears backlog in LIFO order.

  • It will rollover multiple times as it catches up. 2019-10-21 10-11am data (roughly by size) to indexname-00001, 2019-10-21 9-10am data to indexname-00002, ..., and 2019-10-21 3-4am to indexname-00008. Note that rollover timestamp of indexname-00008 is before than one of indexname-00001, although event timestamp of indexname-00001 is later than indexname-00008
  • It can cause some confusion in ILM phases that use roll over timestamp (such as transition into cold, delete phase).
    • For example, for delete phase, it will delete newer data first. (indexname-00001 first although its data is newer than indexname-00008's)
  • I see that 7.5 has a new setting called origination date that can help with this problem.
    • Note the limitation if we're parsing origination date from the index name using index.lifecycle.parse_origination_date option, the resolution of that timestamp is up to days (not hours).

Since we're not upgrading to 7 yet, only workaround I see is the following:

  • Set up ILM policy daily on the following index patterns indexname-YYYYMMDD-0000x
  • Purge ILM policy that are older than X days (number of days I set on delete phase)
  • Adjust timing for delete and cold phase to reflect the "origination date". If the documents got ingested 1 day late due to ingest issue, set those timing to x-1 days.

I can meet most of my goals

  1. Rollover by Size: daily ILM policy will help us to set up hot-to-warm rollover by size.
  2. Clear backlog in LIFO order: this is something I can set up on ingest side (unrelated to ILM)
  3. Age off/Delete indices: Timing for cold, delete phase is still calculated based on roll over timestamp, and it still can be out of order when compared to the event timestamp. However, we're compensating this by adjusting timing of these phases, as described above.
  4. Hot-warm-cold architecture: daily ILM policy will phase indices accordingly.

Thinking through this I'm convinced that for the LIFO requirement, it's just better to wait until upgrade to 7.5.x. Any thought?

While you may be able to shoehorn your scenario into ILM with the approach you outlined, it seems like it will bring unexpected results.

I do not see a way to cleanly combine LIFO with ILM at this time (even with the as-yet-unreleased origination_date). So long as you are ingesting with a LIFO approach, you would be far better suited to continue using datestamp named indices, with the contents of said indices being restricted to the indicated age the same way you did before. Curator would be the preferred way to address retention with datestamp named indices today, though you might be able to make non-rollover indices work in ILM in the future with origination_date.

The reason I say this is that the delete phase is determined by the number of hours/days since rollover, not since index creation. Even with the datestamp in the index name and using the as-yet-unreleased origination_date, I see no way for you to guarantee your data will completely fit within the timestamp you affix to the indices. Writing to aliases, as rollover requires, prevents you from being able to guarantee that the index content will match the index name's date.

1 Like

Thanks for your response!

Even with the datestamp in the index name and using the as-yet-unreleased origination_date , I see no way for you to guarantee your data will completely fit within the timestamp you affix to the indices.

Since I can control datestamp from the ingest using event timestamp (using ES-Hadoop), I think I'll be able to guarantee that data completely fits within a day. Yes, I recognize that it's not upto hour resolution. As I stated:

Note the limitation if we're parsing origination date from the index name using index.lifecycle.parse_origination_date option, the resolution of that timestamp is up to days (not hours).

However, since I'm already running curator daily for purging hourly indices, I think this can be acceptable.

Again, the problem I'm trying to solve here while I'm considering all the edge case is this: optimize shard balancing. The biggest limitation with hourly indices are seasonality with the dataset and variance in shard size, which cause to have sub-optimal disk usage in the cluster.

This will only be guaranteeable if you force alias/rollover manually. Otherwise I fear that data inflow will go to whichever index the alias points to, regardless of the age of the data being sent.

You're right on the ingest will go to whichever index the alias points to. I was trying to describe "force alias/rollover manually" by saying

Set up ILM policy daily on the following index patterns indexname-YYYYMMDD-0000x

which basically means that there's a daily alias.

On the other hand, what's the best practice to handle shard-size variance at ingest time when meeting these requirement?

You can't. It's no different from daily or hourly indices. That's why I suggested sticking with standard, non-aliased daily/hourly indices.

Your best bet to reduce shard count in such cases might be to re-index into separate indices, a many-to-one approach.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.