Dec 5th, 2023: [EN] Santa? It's Time To Leave! - Using TTL on Elasticsearch documents

Cet article est également disponible en français.

Imagine that Santa has to deliver presents to all the children in the world. He has a lot of work to do and he needs to be efficient. He has a list of all the children and he knows where they live. He will most likely group the presents by area and then he will deliver them. But he will not stay at the same place for too long. He will just drop the presents and leave. He will not wait for the children to open the presents. He will just leave.

May be we could suggest him to have a list of the cities he still has to visit. And then he could remove the cities from the list once he has delivered the presents. This way, he will know where he still has to go. And he will not waste time to go back to the same place.

To do this, he could just use a TTL (Time To Live) on the cities he has to visit. He will just set the TTL to the time he needs to deliver the presents. And then he will just remove the cities from the list once the TTL is expired.

This is what his journey could look like:

{ "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
{ "city": "Singapore", "deliver": "1 minute", "ttl": 1 }
{ "city": "Vilnius", "deliver": "3 minutes", "ttl": 3 }
{ "city": "Paris", "deliver": "5 minutes", "ttl": 5 }
{ "city": "London", "deliver": "6 minutes", "ttl": 6 }
{ "city": "Montréal", "deliver": "7 minutes", "ttl": 7 }
{ "city": "San Francisco", "deliver": "9 minutes", "ttl": 9 }
{ "city": "North Pole", "deliver": "forever" }

ttl contains our TTL value in minutes:

  • no value means that the document will be kept forever.
  • zero means that we want to remove the document as soon as possible.
  • any positive value correspond to the number of minutes we need to wait before removing the document.

The ttl ingest pipeline

To implement such a feature, you just need an ingest pipeline:

DELETE /_ingest/pipeline/ttl
PUT /_ingest/pipeline/ttl
{
  "processors": [
    {
      "set": {
        "field": "ingest_date",
        "value": "{{{_ingest.timestamp}}}"
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
          ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusMinutes(ctx['ttl']);
        """,
        "if": "ctx?.ttl != null"
      }
    },
    {
      "remove": {
        "field": [ "ingest_date", "ttl" ],
        "ignore_missing": true
      }
    }
  ]
}

Let's explain this a bit.

The first processor sets a temporary field (ingest_date) within the document and we inject in it the time when the pipeline is executed (_ingest.timestamp) which is more or less the indexation date.

Then we run a painless script:

ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusMinutes(ctx['ttl']);

This script creates a ZonedDateTime java object from the String value available in ingest_date field. Then we just call plusMinutes method and we provide the value of ttl as the parameter. This will just shift the ingest date by some minutes. And we store the result in a ttl_date new field.

Note that we need to add a condition to run this processor only if a ttl field exists:

"if": "ctx?.ttl != null"

Then we just remove the non needed fields ingest_date and optionally ttl if we don't need it anymore. Note that for debug purposes, it might be smart to keep ttl around. In case no ttl is set, we also need to ignore it if missing. This is done with the "ignore_missing": true parameter.

To test this pipeline, we can use the simulate API:

POST /_ingest/pipeline/ttl/_simulate?filter_path=docs.doc._source,docs.doc._ingest
{
  "docs": [
    {
      "_source": { "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
    },
    {
      "_source": { "city": "Singapore", "deliver": "1 minute", "ttl": 1 }
    },
    {
      "_source": { "city": "Paris", "deliver": "5 minutes", "ttl": 5 }
    },
    {
      "_source": { "city": "North Pole", "deliver": "forever" }
    }
  ]  
}

This gives:

{
  "docs": [
    {
      "doc": {
        "_source": {
          "deliver": "ASAP",
          "ttl_date": "2023-11-23T11:14:42.723353333Z",
          "city": "Sidney"
        },
        "_ingest": {
          "timestamp": "2023-11-23T11:14:42.723353333Z"
        }
      }
    },
    {
      "doc": {
        "_source": {
          "deliver": "1 minute",
          "ttl_date": "2023-11-23T11:15:42.723413177Z",
          "city": "Singapore"
        },
        "_ingest": {
          "timestamp": "2023-11-23T11:14:42.723413177Z"
        }
      }
    },
    {
      "doc": {
        "_source": {
          "deliver": "5 minutes",
          "ttl_date": "2023-11-23T11:19:42.723419835Z",
          "city": "Paris"
        },
        "_ingest": {
          "timestamp": "2023-11-23T11:14:42.723419835Z"
        }
      }
    },
    {
      "doc": {
        "_source": {
          "city": "North Pole",
          "deliver": "forever"
        },
        "_ingest": {
          "timestamp": "2023-11-23T11:14:42.723423778Z"
        }
      }
    }
  ]
}

We can see the shifted dates for document removal.

Automatically create the ttl_date field

We can use the final_pipeline index setting to define the ttl pipeline as the one to use just before the actual index operation.

DELETE /ttl-demo
PUT /ttl-demo
{
  "settings": {
    "final_pipeline": "ttl"
  },
  "mappings": {
    "_source": {
      "excludes": [
        "ttl_date"
      ]
    },
    "properties": {
      "ttl_date": {
        "type": "date"
      }
    }
  }
}

You could also use the default_pipeline index setting but you need to be aware that the ttl pipeline won't be called if one user would like to index a document with a user pipeline like:

POST /ttl-demo/_doc?pipeline=my-pipeline
{ 
  "city": "Singapore", 
  "deliver": "1 minute", 
  "ttl": 1
}

Note that we remove the ttl_date field from the _source field. We don't want to store it in the _source field as it's just a "technical" field.

Index documents

We can inject now our dataset:

POST /ttl-demo/_bulk
{ "index": {} }
{ "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
{ "index": {} }
{ "city": "Singapore", "deliver": "1 minute", "ttl": 1 }
{ "index": {} }
{ "city": "Vilnius", "deliver": "3 minutes", "ttl": 3 }
{ "index": {} }
{ "city": "Paris", "deliver": "5 minutes", "ttl": 5 }
{ "index": {} }
{ "city": "London", "deliver": "6 minutes", "ttl": 6 }
{ "index": {} }
{ "city": "Montréal", "deliver": "7 minutes", "ttl": 7 }
{ "index": {} }
{ "city": "San Francisco", "deliver": "9 minutes", "ttl": 9 }
{ "index": {} }
{ "city": "North Pole", "deliver": "forever" }

Remove TTL'ed documents

It's now easy to run a Delete By Query call:

POST /ttl-demo/_delete_by_query
{
  "query": {
    "range": {
      "ttl_date": {
        "lte": "now"
      }
    }
  }
}

We just want to delete all the documents which are older than now which is the time of the execution of the request.

If we run it immediately, we can see that only the document { "city": "Sidney", "deliver": "ASAP", "ttl": 0 } is removed.
After one minute, { "city": "Singapore", "deliver": "1 minute", "ttl": 1 }. And after some other minutes, only { "city": "North Pole", "deliver": "forever" } is remaining. It will be kept forever.

Using Watcher to run it every minutes

You can use a crontab to run such a query every minute:

* * * * * curl -XPOST -u elastic:changeme https://127.0.0.1:9200/ttl-demo/_delete_by_query -H 'Content-Type: application/json' -d '{"query":{"range":{"ttl_date":{"lte":"now"}}}}'

Note that you will have to monitor this job. But if you have a commercial license, you could also run this directly from Elasticsearch using Watcher:

PUT _watcher/watch/ttl
{
  "trigger": {
    "schedule": {
      "interval": "1m"
    }
  },
  "input": {
    "simple" : {}
  },
  "condition": {
    "always" : {}
  },
  "actions": {
    "call_dbq": {
      "webhook": {
        "url": "https://127.0.0.1:9200/ttl-demo/_delete_by_query",
        "method": "post",
        "body": "{\"query\":{\"range\":{\"ttl_date\":{\"lte\":\"now\"}}}}",
        "auth": {
          "basic": {
            "username": "elastic",
            "password": "changeme"
          }
        }
      }
    }
  }
}

Note that we use the interval parameter to run this action every minute. And we use a webhook to call the Delete By Query API. Note also that we need to provide the authentication information.

If you are not running on cloud.elastic.co but locally with a self-signed certificate, as Elasticsearch is secured by default, you need to set xpack.http.ssl.verification_mode to none. Otherwise Elasticsearch is not going to accept the self-signed certificate. Of course, this is just for testing purpose. Don't do that in production!

After at most one minute, Santa will now see that the cities he has to visit are removed from the list:

GET /ttl-demo/_search
{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "ttl-demo",
        "_id": "IBTn-4sBOKvQy-0aU35M",
        "_score": 1,
        "_source": {
          "deliver": "3 minutes",
          "city": "Vilnius"
        }
      },
      {
        "_index": "ttl-demo",
        "_id": "IRTn-4sBOKvQy-0aU35M",
        "_score": 1,
        "_source": {
          "deliver": "5 minutes",
          "city": "Paris"
        }
      },
      {
        "_index": "ttl-demo",
        "_id": "IhTn-4sBOKvQy-0aU35M",
        "_score": 1,
        "_source": {
          "deliver": "6 minutes",
          "city": "London"
        }
      },
      {
        "_index": "ttl-demo",
        "_id": "IxTn-4sBOKvQy-0aU35M",
        "_score": 1,
        "_source": {
          "deliver": "7 minutes",
          "city": "Montréal"
        }
      },
      {
        "_index": "ttl-demo",
        "_id": "JBTn-4sBOKvQy-0aU35M",
        "_score": 1,
        "_source": {
          "deliver": "9 minutes",
          "city": "San Francisco"
        }
      },
      {
        "_index": "ttl-demo",
        "_id": "JRTn-4sBOKvQy-0aU35M",
        "_score": 1,
        "_source": {
          "city": "North Pole",
          "deliver": "forever"
        }
      }
    ]
  }
}

And then at the end, it will only remains his home:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "ttl-demo",
        "_id": "JRTn-4sBOKvQy-0aU35M",
        "_score": 1,
        "_source": {
          "city": "North Pole",
          "deliver": "forever"
        }
      }
    ]
  }
}

This is an easy quick and dirty solution to remove old data from your Elasticsearch cluster. But note that you should never apply this technique for logs or any time based indices or if the quantity of data to be removed that way is more than, let say 10% of the dataset.

Instead you should prefer the Delete Index API to drop a full index at once instead of removing a full set of documents. That's a way much more efficient.

Drop the index (preferred)

To do this, we can actually change the pipeline to send the data to an index where the name contains the ttl date:

PUT /_ingest/pipeline/ttl
{
  "processors": [
    {
      "set": {
        "field": "ingest_date",
        "value": "{{{_ingest.timestamp}}}"
      }
    },
    {
      "script": {
        "lang": "painless",
        "source": """
          ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusDays(ctx['ttl']);
        """,
        "ignore_failure": true
      }
    },
    {
      "date_index_name" : {
        "field" : "ttl_date",
        "index_name_prefix" : "ttl-demo-",
        "date_rounding" : "d",
        "date_formats": ["yyyy-MM-dd'T'HH:mm:ss.nz"],
        "index_name_format": "yyyy-MM-dd",
        "if": "ctx?.ttl_date != null"
      }
    },
    {
      "set": {
        "field" : "_index",
        "value": "ttl-demo-forever",
        "if": "ctx?.ttl_date == null"
      }
    },
    {
      "remove": {
        "field": [ "ingest_date", "ttl", "ttl_date" ],
        "ignore_missing": true
      }
    }
  ]
}

In this example, I switched to daily indices because it's much more accurate than what you could see in production as you normally don't expire data after few minutes but more after days or months.

We changed the script to add days instead:

ctx['ttl_date'] = ZonedDateTime.parse(ctx['ingest_date']).plusDays(ctx['ttl']);

If the ttl_date field exists, we use the date_index_name processor to create a new index name based on the ttl_date field. We use the date_rounding parameter to round the date to the day. And we use the index_name_format parameter to format the date as yyyy-MM-dd. This will use index names like ttl-demo-2023-11-27:

{
  "date_index_name" : {
    "field" : "ttl_date",
    "index_name_prefix" : "ttl-demo-",
    "date_rounding" : "d",
    "date_formats": ["yyyy-MM-dd'T'HH:mm:ss.nz"],
    "index_name_format": "yyyy-MM-dd",
    "if": "ctx?.ttl_date != null"
  }
}

If the ttl_date field does not exist, we just set the index name to ttl-demo-forever:

{
  "set": {
    "field" : "_index",
    "value": "ttl-demo-forever",
    "if": "ctx?.ttl_date == null"
  }
}

We can reindex our dataset:

POST /ttl-demo/_bulk?pipeline=ttl
{ "index": {} }
{ "city": "Sidney", "deliver": "ASAP", "ttl": 0 }
{ "index": {} }
{ "city": "Singapore", "deliver": "1 day", "ttl": 1 }
{ "index": {} }
{ "city": "Vilnius", "deliver": "3 days", "ttl": 3 }
{ "index": {} }
{ "city": "Paris", "deliver": "5 days", "ttl": 5 }
{ "index": {} }
{ "city": "North Pole", "deliver": "forever" }

And we can see that the documents are now in different indices:

GET /ttl-demo-*/_search?filter_path=hits.hits._index,hits.hits._source.deliver

Gives:

{
  "hits": {
    "hits": [
      {
        "_index": "ttl-demo-2023-11-27",
        "_source": {
          "deliver": "ASAP"
        }
      },
      {
        "_index": "ttl-demo-2023-11-28",
        "_source": {
          "deliver": "1 day"
        }
      },
      {
        "_index": "ttl-demo-2023-11-30",
        "_source": {
          "deliver": "3 days"
        }
      },
      {
        "_index": "ttl-demo-2023-12-02",
        "_source": {
          "deliver": "5 days"
        }
      },
      {
        "_index": "ttl-demo-forever",
        "_source": {
          "deliver": "forever"
        }
      }
    ]
  }
}

The index name does not refer anymore as we are used to see the date of the data but the date of the removal of the data. So we can run again a crontab to remove the old indices everyday. The following script is intended to be run on a Mac OS X system:

0 0 * * * curl -XDELETE -u elastic:changeme https://127.0.0.1:9200/ttl-demo-$(date -v -1d -j +%F)

Wrapping up

We saw 2 ways for doing TTL on Elasticsearch documents. The first one is to use a TTL field and to remove the documents using a Delete By Query call. The second one (much more efficient with a lot of data to be removed) is to use a TTL field to route the documents to different indices and then to remove the indices using a crontab.

But for those both solutions, the documents are still visible until the contrab runs.

You could think of using an index filtered alias to hide the old documents:

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "ttl-demo",
        "alias": "ttl-filtered",
        "filter": {
          "bool": {
            "filter": [
              {
                "range": {
                  "ttl_date": {
                    "gt": "now/m"
                  }
                }
              }
            ]
          }
        }
      }
    }
  ]
}

Searching within the ttl-filtered alias will only return the documents which are not expired yet even if the expired documents are not removed yet by the batch process (crontab or watcher).

Santa can now know where to go next safely and then enjoy a well deserved year of rest!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.