# Dec 5th, 2023: [EN] Santa? It's Time To Leave! - Using TTL on Elasticsearch documents

Cet article est également disponible en français.

Imagine that Santa has to deliver presents to all the children in the world. He has a lot of work to do and he needs to be efficient. He has a list of all the children and he knows where they live. He will most likely group the presents by area and then he will deliver them. But he will not stay at the same place for too long. He will just drop the presents and leave. He will not wait for the children to open the presents. He will just leave.

May be we could suggest him to have a list of the cities he still has to visit. And then he could remove the cities from the list once he has delivered the presents. This way, he will know where he still has to go. And he will not waste time to go back to the same place.

To do this, he could just use a TTL (Time To Live) on the cities he has to visit. He will just set the TTL to the time he needs to deliver the presents. And then he will just remove the cities from the list once the TTL is expired.

This is what his journey could look like:

``````{ "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
{ "city": "Singapore", "deliver": "1 minute", "ttl": 1 }
{ "city": "Vilnius", "deliver": "3 minutes", "ttl": 3 }
{ "city": "Paris", "deliver": "5 minutes", "ttl": 5 }
{ "city": "London", "deliver": "6 minutes", "ttl": 6 }
{ "city": "Montréal", "deliver": "7 minutes", "ttl": 7 }
{ "city": "San Francisco", "deliver": "9 minutes", "ttl": 9 }
{ "city": "North Pole", "deliver": "forever" }
``````

`ttl` contains our TTL value in minutes:

• no value means that the document will be kept forever.
• zero means that we want to remove the document as soon as possible.
• any positive value correspond to the number of minutes we need to wait before removing the document.

## The ttl ingest pipeline

To implement such a feature, you just need an ingest pipeline:

``````DELETE /_ingest/pipeline/ttl
PUT /_ingest/pipeline/ttl
{
"processors": [
{
"set": {
"field": "ingest_date",
"value": "{{{_ingest.timestamp}}}"
}
},
{
"script": {
"lang": "painless",
"source": """
ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusMinutes(ctx['ttl']);
""",
"if": "ctx?.ttl != null"
}
},
{
"remove": {
"field": [ "ingest_date", "ttl" ],
"ignore_missing": true
}
}
]
}
``````

Let's explain this a bit.

The first processor sets a temporary field (`ingest_date`) within the document and we inject in it the time when the pipeline is executed (`_ingest.timestamp`) which is more or less the indexation date.

Then we run a painless script:

``````ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusMinutes(ctx['ttl']);
``````

This script creates a `ZonedDateTime` java object from the String value available in `ingest_date` field. Then we just call `plusMinutes` method and we provide the value of `ttl` as the parameter. This will just shift the ingest date by some minutes. And we store the result in a `ttl_date` new field.

Note that we need to add a condition to run this processor only if a `ttl` field exists:

``````"if": "ctx?.ttl != null"
``````

Then we just remove the non needed fields `ingest_date` and optionally `ttl` if we don't need it anymore. Note that for debug purposes, it might be smart to keep `ttl` around. In case no `ttl` is set, we also need to ignore it if missing. This is done with the `"ignore_missing": true` parameter.

To test this pipeline, we can use the simulate API:

``````POST /_ingest/pipeline/ttl/_simulate?filter_path=docs.doc._source,docs.doc._ingest
{
"docs": [
{
"_source": { "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
},
{
"_source": { "city": "Singapore", "deliver": "1 minute", "ttl": 1 }
},
{
"_source": { "city": "Paris", "deliver": "5 minutes", "ttl": 5 }
},
{
"_source": { "city": "North Pole", "deliver": "forever" }
}
]
}
``````

This gives:

``````{
"docs": [
{
"doc": {
"_source": {
"deliver": "ASAP",
"ttl_date": "2023-11-23T11:14:42.723353333Z",
"city": "Sidney"
},
"_ingest": {
"timestamp": "2023-11-23T11:14:42.723353333Z"
}
}
},
{
"doc": {
"_source": {
"deliver": "1 minute",
"ttl_date": "2023-11-23T11:15:42.723413177Z",
"city": "Singapore"
},
"_ingest": {
"timestamp": "2023-11-23T11:14:42.723413177Z"
}
}
},
{
"doc": {
"_source": {
"deliver": "5 minutes",
"ttl_date": "2023-11-23T11:19:42.723419835Z",
"city": "Paris"
},
"_ingest": {
"timestamp": "2023-11-23T11:14:42.723419835Z"
}
}
},
{
"doc": {
"_source": {
"city": "North Pole",
"deliver": "forever"
},
"_ingest": {
"timestamp": "2023-11-23T11:14:42.723423778Z"
}
}
}
]
}
``````

We can see the shifted dates for document removal.

## Automatically create the `ttl_date` field

We can use the `final_pipeline` index setting to define the `ttl` pipeline as the one to use just before the actual index operation.

``````DELETE /ttl-demo
PUT /ttl-demo
{
"settings": {
"final_pipeline": "ttl"
},
"mappings": {
"_source": {
"excludes": [
"ttl_date"
]
},
"properties": {
"ttl_date": {
"type": "date"
}
}
}
}
``````

You could also use the `default_pipeline` index setting but you need to be aware that the `ttl` pipeline won't be called if one user would like to index a document with a user pipeline like:

``````POST /ttl-demo/_doc?pipeline=my-pipeline
{
"city": "Singapore",
"deliver": "1 minute",
"ttl": 1
}
``````

Note that we remove the `ttl_date` field from the `_source` field. We don't want to store it in the `_source` field as it's just a "technical" field.

## Index documents

We can inject now our dataset:

``````POST /ttl-demo/_bulk
{ "index": {} }
{ "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
{ "index": {} }
{ "city": "Singapore", "deliver": "1 minute", "ttl": 1 }
{ "index": {} }
{ "city": "Vilnius", "deliver": "3 minutes", "ttl": 3 }
{ "index": {} }
{ "city": "Paris", "deliver": "5 minutes", "ttl": 5 }
{ "index": {} }
{ "city": "London", "deliver": "6 minutes", "ttl": 6 }
{ "index": {} }
{ "city": "Montréal", "deliver": "7 minutes", "ttl": 7 }
{ "index": {} }
{ "city": "San Francisco", "deliver": "9 minutes", "ttl": 9 }
{ "index": {} }
{ "city": "North Pole", "deliver": "forever" }
``````

## Remove TTL'ed documents

It's now easy to run a Delete By Query call:

``````POST /ttl-demo/_delete_by_query
{
"query": {
"range": {
"ttl_date": {
"lte": "now"
}
}
}
}
``````

We just want to delete all the documents which are older than `now` which is the time of the execution of the request.

If we run it immediately, we can see that only the document `{ "city": "Sidney", "deliver": "ASAP", "ttl": 0 }` is removed.
After one minute, `{ "city": "Singapore", "deliver": "1 minute", "ttl": 1 }`. And after some other minutes, only `{ "city": "North Pole", "deliver": "forever" }` is remaining. It will be kept forever.

## Using Watcher to run it every minutes

You can use a `crontab` to run such a query every minute:

``````* * * * * curl -XPOST -u elastic:changeme https://127.0.0.1:9200/ttl-demo/_delete_by_query -H 'Content-Type: application/json' -d '{"query":{"range":{"ttl_date":{"lte":"now"}}}}'
``````

Note that you will have to monitor this job. But if you have a commercial license, you could also run this directly from Elasticsearch using Watcher:

``````PUT _watcher/watch/ttl
{
"trigger": {
"schedule": {
"interval": "1m"
}
},
"input": {
"simple" : {}
},
"condition": {
"always" : {}
},
"actions": {
"call_dbq": {
"webhook": {
"url": "https://127.0.0.1:9200/ttl-demo/_delete_by_query",
"method": "post",
"body": "{\"query\":{\"range\":{\"ttl_date\":{\"lte\":\"now\"}}}}",
"auth": {
"basic": {
}
}
}
}
}
}
``````

Note that we use the `interval` parameter to run this action every minute. And we use a webhook to call the Delete By Query API. Note also that we need to provide the authentication information.

If you are not running on cloud.elastic.co but locally with a self-signed certificate, as Elasticsearch is secured by default, you need to set `xpack.http.ssl.verification_mode` to `none`. Otherwise Elasticsearch is not going to accept the self-signed certificate. Of course, this is just for testing purpose. Don't do that in production!

After at most one minute, Santa will now see that the cities he has to visit are removed from the list:

``````GET /ttl-demo/_search
``````
``````{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 6,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "ttl-demo",
"_id": "IBTn-4sBOKvQy-0aU35M",
"_score": 1,
"_source": {
"deliver": "3 minutes",
"city": "Vilnius"
}
},
{
"_index": "ttl-demo",
"_id": "IRTn-4sBOKvQy-0aU35M",
"_score": 1,
"_source": {
"deliver": "5 minutes",
"city": "Paris"
}
},
{
"_index": "ttl-demo",
"_id": "IhTn-4sBOKvQy-0aU35M",
"_score": 1,
"_source": {
"deliver": "6 minutes",
"city": "London"
}
},
{
"_index": "ttl-demo",
"_id": "IxTn-4sBOKvQy-0aU35M",
"_score": 1,
"_source": {
"deliver": "7 minutes",
"city": "Montréal"
}
},
{
"_index": "ttl-demo",
"_id": "JBTn-4sBOKvQy-0aU35M",
"_score": 1,
"_source": {
"deliver": "9 minutes",
"city": "San Francisco"
}
},
{
"_index": "ttl-demo",
"_id": "JRTn-4sBOKvQy-0aU35M",
"_score": 1,
"_source": {
"city": "North Pole",
"deliver": "forever"
}
}
]
}
}
``````

And then at the end, it will only remains his home:

``````{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "ttl-demo",
"_id": "JRTn-4sBOKvQy-0aU35M",
"_score": 1,
"_source": {
"city": "North Pole",
"deliver": "forever"
}
}
]
}
}
``````

This is an easy quick and dirty solution to remove old data from your Elasticsearch cluster. But note that you should never apply this technique for logs or any time based indices or if the quantity of data to be removed that way is more than, let say 10% of the dataset.

Instead you should prefer the Delete Index API to drop a full index at once instead of removing a full set of documents. That's a way much more efficient.

## Drop the index (preferred)

To do this, we can actually change the pipeline to send the data to an index where the name contains the ttl date:

``````PUT /_ingest/pipeline/ttl
{
"processors": [
{
"set": {
"field": "ingest_date",
"value": "{{{_ingest.timestamp}}}"
}
},
{
"script": {
"lang": "painless",
"source": """
ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusDays(ctx['ttl']);
""",
"ignore_failure": true
}
},
{
"date_index_name" : {
"field" : "ttl_date",
"index_name_prefix" : "ttl-demo-",
"date_rounding" : "d",
"date_formats": ["yyyy-MM-dd'T'HH:mm:ss.nz"],
"index_name_format": "yyyy-MM-dd",
"if": "ctx?.ttl_date != null"
}
},
{
"set": {
"field" : "_index",
"value": "ttl-demo-forever",
"if": "ctx?.ttl_date == null"
}
},
{
"remove": {
"field": [ "ingest_date", "ttl", "ttl_date" ],
"ignore_missing": true
}
}
]
}
``````

In this example, I switched to daily indices because it's much more accurate than what you could see in production as you normally don't expire data after few minutes but more after days or months.

``````ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusDays(ctx['ttl']);
``````

If the `ttl_date` field exists, we use the `date_index_name` processor to create a new index name based on the `ttl_date` field. We use the `date_rounding` parameter to round the date to the day. And we use the `index_name_format` parameter to format the date as `yyyy-MM-dd`. This will use index names like `ttl-demo-2023-11-27`:

``````{
"date_index_name" : {
"field" : "ttl_date",
"index_name_prefix" : "ttl-demo-",
"date_rounding" : "d",
"date_formats": ["yyyy-MM-dd'T'HH:mm:ss.nz"],
"index_name_format": "yyyy-MM-dd",
"if": "ctx?.ttl_date != null"
}
}
``````

If the `ttl_date` field does not exist, we just set the index name to `ttl-demo-forever`:

``````{
"set": {
"field" : "_index",
"value": "ttl-demo-forever",
"if": "ctx?.ttl_date == null"
}
}
``````

We can reindex our dataset:

``````POST /ttl-demo/_bulk?pipeline=ttl
{ "index": {} }
{ "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
{ "index": {} }
{ "city": "Singapore", "deliver": "1 day", "ttl": 1 }
{ "index": {} }
{ "city": "Vilnius", "deliver": "3 days", "ttl": 3 }
{ "index": {} }
{ "city": "Paris", "deliver": "5 days", "ttl": 5 }
{ "index": {} }
{ "city": "North Pole", "deliver": "forever" }
``````

And we can see that the documents are now in different indices:

``````GET /ttl-demo-*/_search?filter_path=hits.hits._index,hits.hits._source.deliver
``````

Gives:

``````{
"hits": {
"hits": [
{
"_index": "ttl-demo-2023-11-27",
"_source": {
"deliver": "ASAP"
}
},
{
"_index": "ttl-demo-2023-11-28",
"_source": {
"deliver": "1 day"
}
},
{
"_index": "ttl-demo-2023-11-30",
"_source": {
"deliver": "3 days"
}
},
{
"_index": "ttl-demo-2023-12-02",
"_source": {
"deliver": "5 days"
}
},
{
"_index": "ttl-demo-forever",
"_source": {
"deliver": "forever"
}
}
]
}
}
``````

The index name does not refer anymore as we are used to see the date of the data but the date of the removal of the data. So we can run again a crontab to remove the old indices everyday. The following script is intended to be run on a Mac OS X system:

``````0 0 * * * curl -XDELETE -u elastic:changeme https://127.0.0.1:9200/ttl-demo-\$(date -v -1d -j +%F)
``````

## Wrapping up

We saw 2 ways for doing TTL on Elasticsearch documents. The first one is to use a TTL field and to remove the documents using a Delete By Query call. The second one (much more efficient with a lot of data to be removed) is to use a TTL field to route the documents to different indices and then to remove the indices using a crontab.

But for those both solutions, the documents are still visible until the contrab runs.

You could think of using an index filtered alias to hide the old documents:

``````POST _aliases
{
"actions": [
{
"index": "ttl-demo",
"alias": "ttl-filtered",
"filter": {
"bool": {
"filter": [
{
"range": {
"ttl_date": {
"gt": "now/m"
}
}
}
]
}
}
}
}
]
}
``````

Searching within the `ttl-filtered` alias will only return the documents which are not expired yet even if the expired documents are not removed yet by the batch process (crontab or watcher).

Santa can now know where to go next safely and then enjoy a well deserved year of rest!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.