Implement TTL for "dynamic" documents

Current situation

I have an Elasticsearch workflow equivalent to the following:
I have tons of "animal sighting stations", each occasionally sending a "sighting event" of a specific animal species to the backend.
The backend stores only one document per (station, animal) pair. It upserts it into Elasticsearch in the following format:

stationId: string
animalSpecies: string
lastSeenAt: Date
suspectedRabies: boolean
<more redacted fields>
  • The lastSeenAt timestamp is updated on each upsert.
  • The suspectedRabies boolean is set to true when receiving the first sighting event with suspectedRabies: true (and it stays true regardless of future upserts).

The problem

There's a lot of space wasted on no-more-relevant documents, so I want to add a retention policy.
For example, if at least 10 days have passed since lastSeenAt, delete the document.

Solutions I've considered

1. ILM: Inserting a new document each time and taking the latest timestamp

Won't work for a couple of reasons:

  • This index receives all sorts of elaborate aggregations, which will be virtually impossible to implement once there are multiple copies of each sighting.
  • The suspectedRabies property has to be returned as true if one of the copies has it as true. (This can be solved by adding a _search.)

2. ILM: Deleting and creating the document

The problem with naively using ILM is that updating a document doesn't advance it to another tier, which means that "active sightings" may get deleted (but we want to keep them).

So, this solution involves not updating the document, but instead doing the following:

  1. Check if the sighting is in the index alias.
  2. If it is, delete it, and then insert it (making sure to choose the "maximum" value of suspectedRabies).
    If it's not, insert it.

What worries me about this solution are race conditions/version conflicts, but perhaps these can be solved with enough optimistic concurrency control and/or optimism.

An example case I want to avoid:
A sighting without suspectedRabies arrives. It thinks it's not in the index (because another sighting just arrived & deleted it), and then it is added to the index.
Potentially, the suspectedRabies: true property, which might have been there before the deletion, is now lost.

3. Using a periodic "Delete by query"

Apparently, this solution is similar to the implementation of Elasticsearch's historic _ttl feature.
As this feature was deleted due to performance issues, I rather avoid it.

Questions

Does one of the solutions above seem viable (at scale)?
Are there better solutions I'm missing?
Am I using Elasticsearch all wrong to begin with?

Any feedback would be much appreciated.

ILM requires time-based indices, which makes it difficult to update/upsert the way you describe.

This is indeed what the old TTL implementation used behind the scenes. Delete-by-query still exists and as far as I know is the only solution that gives the flexibility that you require. You will however have to trigger delete-by-query from outside Elasticsearch. If you run it reasonably frequently and only delete a small portion of the data each time I think it would still work well even if it is more inefficient than deleting complete indices.

Thank you for your response!

My concern about the 3rd solution is that I can not guarantee that only "a small portion of the data" would be deleted on each run.


What do you think about the 2nd solution? I haven't any significant disadvantage with it (assuming the mentioned race conditions can be handled with), except that the code implementing it would be rather unpleasant.

What is the total expected size of the data set?

It will depend on the chosen TTL, but it may be around 100GB-200GB (not counting replicas).

That is not very large, so I would recommend to try delete-by-query rather than add all the extra complexity of option 2.