Current situation
I have an Elasticsearch workflow equivalent to the following:
I have tons of "animal sighting stations", each occasionally sending a "sighting event" of a specific animal species to the backend.
The backend stores only one document per (station, animal) pair. It upserts it into Elasticsearch in the following format:
stationId: string
animalSpecies: string
lastSeenAt: Date
suspectedRabies: boolean
<more redacted fields>
- The
lastSeenAt
timestamp is updated on each upsert. - The
suspectedRabies
boolean is set totrue
when receiving the first sighting event withsuspectedRabies: true
(and it staystrue
regardless of future upserts).
The problem
There's a lot of space wasted on no-more-relevant documents, so I want to add a retention policy.
For example, if at least 10 days have passed since lastSeenAt
, delete the document.
Solutions I've considered
1. ILM: Inserting a new document each time and taking the latest timestamp
Won't work for a couple of reasons:
- This index receives all sorts of elaborate aggregations, which will be virtually impossible to implement once there are multiple copies of each sighting.
- The
suspectedRabies
property has to be returned astrue
if one of the copies has it astrue
. (This can be solved by adding a_search
.)
2. ILM: Deleting and creating the document
The problem with naively using ILM is that updating a document doesn't advance it to another tier, which means that "active sightings" may get deleted (but we want to keep them).
So, this solution involves not updating the document, but instead doing the following:
- Check if the sighting is in the index alias.
- If it is, delete it, and then insert it (making sure to choose the "maximum" value of
suspectedRabies
).
If it's not, insert it.
What worries me about this solution are race conditions/version conflicts, but perhaps these can be solved with enough optimistic concurrency control and/or optimism.
An example case I want to avoid:
A sighting without suspectedRabies
arrives. It thinks it's not in the index (because another sighting just arrived & deleted it), and then it is added to the index.
Potentially, the suspectedRabies: true
property, which might have been there before the deletion, is now lost.
3. Using a periodic "Delete by query"
Apparently, this solution is similar to the implementation of Elasticsearch's historic _ttl
feature.
As this feature was deleted due to performance issues, I rather avoid it.
Questions
Does one of the solutions above seem viable (at scale)?
Are there better solutions I'm missing?
Am I using Elasticsearch all wrong to begin with?
Any feedback would be much appreciated.