Permanently Delete All Old Documents with a Rollup Job?

Hello Master Elasticsearch Gurus,

I am new to ELK and last year implemented a simple Elasticsearch server (ver 7.4.0, I know, I have to upgrade) that is archiving some simple production user data. When I set up ES, I thought it would only be a temporary thing, so I did a “bare bones” implementation. But now, ES has proven its worthiness, and my boss would like the server to collect data for a few months.

This is a problem, because my only index is growing at an unsustainable rate. Sooner or later, I’ll have too many documents and, well, I don’t want to think about what happens next:

[root@Linux elasticsearch]# curl -X GET "localhost:9200/_cat/indices?v&pretty"
health status index                    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   myIndex                  12345-A-12345abcde1234   1   1    1227840            0    284.5mb        284.5mb
[root@Linux elasticsearch]#

See? 1,227,840 documents. This index is about to blow up.

The thing is, I really don’t need those documents after, say, a week. Once a document is over seven days old, it can be permanently deleted. The index should live forever, but I need to automate a way to clean out old data.

Being an ES newbie, I’ve read through the online documentation, and I think what I want is a rollup job. (An Index Lifecycle Management policy seems like the wrong approach – I don’t want to ever phase out the index.) The catch with a rollup job is, I don’t really want to roll up any data. After X amount of time, I want to clean out the old data, never to be seen or summarized again. I don’t need to inspect the old data before it gets thrown out; I just need it gone.

Working through the Rollup Jobs section on ES’s online documentation, I think what I need to do is this:

curl -X PUT "localhost:9200/_rollup/job/myCleanUp?pretty" -H 'Content-Type: application/json' -d'
{
    "index_pattern": "myIndex",
    "rollup_index": "willNeverUse",
    "cron": "*/30 * * * * ?",
    "page_size" :1000,
    "groups" : { 
      "date_histogram": {
        "field": "timestamp",
        "fixed_interval": "1h",
        "delay": "7d"
      },
    },
}

In other words: Every thirty minutes, check through index myIndex. If you see any documents older than seven days, roll them up into a rollup index named willNeverUse But don’t actually preserve or summarize any data.

On paper, this looks correct. I don’t dare try to implement it as is because my ES is technically in production.

But this solution is kind of silly, right? I am rolling up a lot of nonexistent data. And while myIndex will remain small and manageable, willNeverUse will continue to grow and grow with nothing but date histogram metadata. Sooner or later, that rollup_index will balloon to an unmanageable size. I’m just kicking the can down the road.

Isn’t there a more direct approach? Can’t I just configure ES to delete all documents in myIndex that are older than 7 days? Thank you.

Hello Master Elasticsearch Gurus,

Any thoughts on this? Much appreciated... :slight_smile:

Please be patient when expecting answers, this forum is run by volunteers.

A couple of remarks. First, 1.2 million documents is not really a lot and should easily be handled by a single small node.

Second, index lifecycle management might be more what you are after. Instead of having a single index, how about having several indices, that are based on date or number of documents. This way you can easily delete a whole index over time, as this would just remove data that is aged.

You may want to read about the time based data flow, then about rollover and lastly again about index lifecycle management, which suddenly makes more sense in the context of having more than one index.

hope this helps a start.

Thanks Alexander,

I apologize for being pushy, I really am extremely grateful for the expertise I get here on the forums. The above was my 60th post, and I have learned so much from you gurus! Bless you for the hard work you put in. Please excuse my bad manners.

I've read a little on index lifecycle management and time-based data. I initially thought that ILM was not the way for me to go because each consecutively-created index would have a new name, would it not? The data collected by my Elasticsearch server will be used by Ops Engineers who will not want to continually repoint their automated scripts and Kibana instances to a new index. A single index with the same permanent name is best for my situation.

Can you offer some advice on the best approach to achieve this? Any wisdom will be greatly appreciated, I assure you. You guys on this forum are awesome. :slight_smile:

you need a write alias to solve that issue. See the rollover_alias in an policy

Thank you, this is the kind of information I need. Much appreciated.