I am looking for a way to keep only recent documents for a particular id. Ideally it happens automatically via some Elasticsearch configuration or with one hop request after a new document is added (and let’s say those are not updated, only added)
But I frankly don’t get the approach and I wasn’t able to make it work
Using FROM option with delete by query
But it seems like “from” is not supported in delete by query since elasticsearch v8 and the author solution is based on at least two requests to perform - one to find documents to delete, another one is for deletion.
Maybe there are other and better options to consider.
Hey David! Thank you for your input. Unfortunately, the document id is different. To give more context every document represents some user action, every action has its id that is used as document _id. Every action is made by user, every user has an unique id. You can treat the id specified in mapping as user id. With this in mind you can imagine it is not about updating a document to the latest state, but more like keep last X actions per user in Elasticsearch. If you have further questions and ideas please be my guest!
Be careful that updating documents requires a lot of io... So it might not be the ideal thing to do. What is the business case? Why do you want to keep only the last x events? Is that for a technical reason or a business reason?
Maybe =) but I am not sure I fully understood pipeline example. If it is a single document per user will it be possible to amend events without loading the whole document from Elasticsearch? Does "POST /events/_doc/?pipeline=events" call actually enhance the list for document abc123? What I miss is how it guarantees last 3 restriction. Can you expand the idea a bit more please?
Business case is to have a fast storage for recent user actions for further processing in real time manner per user on demand. We have long term storage but that's about dozens of thousands of actions per user, terabytes of data, it covers years. But you can imagine it is not so fast to query and also it holds data that is not relevant to the case. The goal is to store only last user activities and be able to get it per user in milliseconds.
As the "business case" is actually a "technical case", I'd not try to solve a problem that does not really exist... I mean that trying to be fast is not something to solve on the business side but on the technical side.
Yes, we did measure that and depending on dataset it can take from seconds to minutes to get necessary data from long term storage.
Sure, you can say it is a technical case even so from product perspective it is required to stay real time in current data flow processing and based on a current user action + last X actions the system needs to evaluate things and provide particular services based on outcome.
Yes, thank you. I've already looked at ILM. It can help with having documents for recent time range, e.g. last 30 days or last half year. But I miss how it can solve last X actions per user task.
I think it is irrelevant to original topic, just note it is not Elasticsearch. Data is stored separately different way but it is not as fast to query and it contains a lot of extra that is not actual for the task. That's why we consider Elasticsearch as a dedicated storage for fast lookup and to only store relevant things. Maybe your point is to not nail it to last 128 documents per user and keep more, still can be fast in terms of search. We just don't have a need to keep more and that can require more disc space / more hardware / bigger cluster without a need.
So there's no automatic way in Elasticsearch to do this.
You will probably have to fetch the latest data for a given user, then get the _id of the oldest document and then delete it by id.
So:
POST index/_doc
GET /index/_search
DELETE /index/_doc/xyz
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.