Background
- Current production cluster (6.5.x) has time series indices with hourly naming convention
indexname-YYYYMMDDHH
. We see variance with index sizes because our time series data has seasonality. It causes the cluster to have sub-optimal disk usage, because shard balancing is done by shard count, not size. - Current cluster has Curator set up that performs several similar actions as ILM (delete, force merge)
Goals
- Rollover by Size: We're planning to do an upgrade to (6.8.x) to enable ILM, so that we can rollover by the index size
- Clear backlog in LIFO order: When there's backlog of documents from any ES outage or issue with ingest (ES-Hadoop in our case), we want the latest data to be ingested so that latest insight is ingested to the cluster first.
- Age off/Delete indices: indices that are older than
x days
are deleted - Set up hot-warm-cold architecture: to decrease search response time with latest data. We're planning to set up "Warm" nodes with lower Storage-to-RAM ratio.
Questions
Consider the following operation scenario. Ingest was stopped for 8 hours (2019-10-21 3am). When re-enabling the ingest (2019-10-21 11am), ES-Hadoop ingest clears backlog in LIFO order.
- It will rollover multiple times as it catches up. 2019-10-21 10-11am data (roughly by size) to
indexname-00001
, 2019-10-21 9-10am data toindexname-00002
, ..., and 2019-10-21 3-4am toindexname-00008
. Note that rollover timestamp ofindexname-00008
is before than one ofindexname-00001
, although event timestamp ofindexname-00001
is later thanindexname-00008
- It can cause some confusion in ILM phases that use roll over timestamp (such as transition into cold, delete phase).
- For example, for delete phase, it will delete newer data first. (
indexname-00001
first although its data is newer thanindexname-00008
's)
- For example, for delete phase, it will delete newer data first. (
- I see that 7.5 has a new setting called origination date that can help with this problem.
- Note the limitation if we're parsing origination date from the index name using index.lifecycle.parse_origination_date option, the resolution of that timestamp is up to days (not hours).
Since we're not upgrading to 7 yet, only workaround I see is the following:
- Set up ILM policy daily on the following index patterns
indexname-YYYYMMDD-0000x
- Purge ILM policy that are older than
X days
(number of days I set on delete phase) - Adjust timing for delete and cold phase to reflect the "origination date". If the documents got ingested 1 day late due to ingest issue, set those timing to
x-1 days
.
I can meet most of my goals
- Rollover by Size: daily ILM policy will help us to set up hot-to-warm rollover by size.
- Clear backlog in LIFO order: this is something I can set up on ingest side (unrelated to ILM)
- Age off/Delete indices: Timing for cold, delete phase is still calculated based on roll over timestamp, and it still can be out of order when compared to the event timestamp. However, we're compensating this by adjusting timing of these phases, as described above.
- Hot-warm-cold architecture: daily ILM policy will phase indices accordingly.
Thinking through this I'm convinced that for the LIFO requirement, it's just better to wait until upgrade to 7.5.x. Any thought?