Best approach to implement ILM on a large index and archive old data

Baygon · March 20, 2023, 7:09am

Hi,

We have a single node cluster where one index unfortunately grew very big (261Gb) as we had no ILM on it. This is a production cluster.
We understand that above 50Gb there is performance degradation and I think we now start to feel it
The logical step would be to implement a ILM Rollover.
The disc on that google cloud compute machine is a 562Gb disk with 461Gb used (102Gb free, ie 82% utilization).
We do not want to expand this disk as it is too troublesome to shrink back later on.
We have about 2 years of data on this index, but only realistically need 1yr. We can have some use cases where we want to restore the older data though.

What would be the best way to proceed to minimize downtime on this node, to archive the >1yr data and implement rollover on this index to keep each shard around 25Gb ?

I understand we can use curator for that purpose. Which action should we use?`

Thanks for your help

theuntergeek · March 20, 2023, 3:23pm

Help me to understand correctly.

Is the data all in a single index? If so, neither ILM nor Curator can help you with that. Both are for managing data at the index level. The only way to purge data from within an index older than a given date would be a delete_by_query operation. This will be painful and slow by comparison, but it will eventually get done. The question is whether the disk I/O from the delete_by_query operation affects performance to a degree that impedes your regular operations.

Wave · March 20, 2023, 3:32pm

What about creating a new ILM controlled index and send current (live) data there. At least with that done your large index won't be continuing to grow.

Then you could reindex with a query to pull the data out say one month at a time and then send it to the new ILM controlled index. It might not be super fast either though.

Baygon · March 21, 2023, 3:03am

Yes everything is in a single index. I/O performance is not too much of a concern as there are plenty of idle time on this machine. Availability is the concern, not latency.

Baygon · March 21, 2023, 3:15am

Thanks, that is a great suggestion!

Can you help me validate the process I see:

Assuming my historical index is called my-index-obj

create new index:
PUT /my-index-000001
create a ILM with Rollover:

PUT /_ilm/policy/my-index-policy
{
  "policy": {                       
    "phases": {
      "hot": {                      
        "actions": {
          "rollover": {             
            "max_size": "25GB",
          }
        }
      },
    }
  }
}

create an index template in kibana
Applying the policy to the new index:

PUT /_template/my-index-template
{
  "index_patterns": ["my-index-*"],                 
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1,
    "index.lifecycle.name": "my-index-policy",      
    "index.lifecycle.rollover_alias": "my-index"    
  }
}

set the new index as write index:

PUT my-index-000001
{
  "aliases": {
    "my-index": {
      "is_write_index": true
    }
  }
}

By doing that I implement the rollover on the new index, and the search would now need to be done on my-index instead of my-index-obj, correct?

How would I go for the reindexing part?

Wave · March 21, 2023, 4:43pm

Hi @Baygon,

Those steps look pretty good to me, but I'd switch the order to: 3,4,1,2,5. You want the template to be applied when you create the new index.

Also, after thinking about it some more I wouldn't recommend reindexing your historical index into the new one because as it reindexes and rolls over, your current data will be spread out across all the indexes. You mention that you only need to keep 1 years worth of data, but if the new data is spread out you'll need to keep all those indices for an extra year since they will all contain recent data, if that makes sense.

The old school way to handle data in elastic was to have the date in the index name and I think that applies here since these will start aging out of a year starting next month.

For example to reindex last year's March data to a new index could be done with:

POST _reindex
{
  "source": {
    "index": "my-index-obj",
    "query": {
      "range": {
        "date": {
          "gte": "2022-03-01T00:00:00.000",
          "lt": "2022-04-01T00:00:00.000"
        }
      }
    }
  },
  "dest": {
    "index": "my-index-2022.03"
  }
}

Actually after thinking about it some more I'd recommend reindexing the data up to the current month and then do the steps you mentioned above. Then once live data is hitting the new index, reindex only the current month (or whatever timeframe you decide on) into that index. That way you won't have your live data spread out across multiple indices and you also won't have your reindexed indices under a lifecycle policy.

Play around with it to get comfortable with what is happening. You can set the date range to be quite small to get a good feel with what is going on. I hope that all makes sense, and good luck.

Of course I have to add the comment to make sure to have taken a snapshot of your data first. Once you are satisfied that your year old data is in the new reindexed indices you can delete my-index-obj.

system · April 18, 2023, 4:43pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can Rollover API / ILM be used to keep only x days data in an index at any point of time Elasticsearch ilm-index-lifecycle-management	15	648	September 20, 2023
Rollover of Index and delete delta data as per customer configuration Elasticsearch	4	258	January 2, 2023
ILM doesn't rollover Elasticsearch ilm-index-lifecycle-management	2	366	April 9, 2023
ILM doesn't start indexes rollover Elasticsearch	9	626	May 13, 2020
ILM timing - based on index pattern name Elasticsearch ilm-index-lifecycle-management	6	1603	December 14, 2019

Best approach to implement ILM on a large index and archive old data

Related topics