Store last X documents per id

Hello Elastic's team and community!

I am looking for a way to keep only recent documents for a particular id. Ideally it happens automatically via some Elasticsearch configuration or with one hop request after a new document is added (and let’s say those are not updated, only added)

You can imagine the mapping at least like:

{
 "properties": {
  "id": {
   "type": "keyword"
  },
  "timestamp": {
   "type": "long"
  }
 }
}

Other fields are omitted for simplicity.

I checked if that problem was solved before and I found:

  1. Is there a way to implement a keep last N docs per id in a specific index?

    But I frankly don’t get the approach and I wasn’t able to make it work

  2. Using FROM option with delete by query
    But it seems like “from” is not supported in delete by query since elasticsearch v8 and the author solution is based on at least two requests to perform - one to find documents to delete, another one is for deletion.

Maybe there are other and better options to consider.

You are welcome to share your ideas.

Thanks in advance!

Regards,

Vasiliy

What about using the id as the document _id and then just update the document with its latest version?

Hey David! Thank you for your input. Unfortunately, the document id is different. To give more context every document represents some user action, every action has its id that is used as document _id. Every action is made by user, every user has an unique id. You can treat the id specified in mapping as user id. With this in mind you can imagine it is not about updating a document to the latest state, but more like keep last X actions per user in Elasticsearch. If you have further questions and ideas please be my guest!

I don't think I can come with any smart idea.

Apart storing in one document only the list of events for a given user:

GET /events/_doc/abc123
{
  "name": "Joe Smith",
  "events": [{
    "@timestamp": "2024-04-02",
    "type": "add"
  },{
    "@timestamp": "2024-04-03",
    "type": "update"
  },{
    "@timestamp": "2024-04-04",
    "type": "remove"
  }
  ]
}

And then with some ingest pipeline keep only the last 3 events when a post like the following is happening:

POST /events/_doc/?pipeline=events
{
  "id": "abc123",
  "name": "Joe Smith",
  "event": {
    "@timestamp": "2024-04-05",
    "type": "something"
  }
}

That would index behind the scene:

PUT /events/_doc/abc123
{
  "name": "Joe Smith",
  "events": [{
    "@timestamp": "2024-04-03",
    "type": "update"
  },{
    "@timestamp": "2024-04-04",
    "type": "remove"
  },{
    "@timestamp": "2024-04-05",
    "type": "something"
  }
  ]
}

Would that work for you?

Be careful that updating documents requires a lot of io... So it might not be the ideal thing to do. What is the business case? Why do you want to keep only the last x events? Is that for a technical reason or a business reason?

Hello and thank you again!

Maybe =) but I am not sure I fully understood pipeline example. If it is a single document per user will it be possible to amend events without loading the whole document from Elasticsearch? Does "POST /events/_doc/?pipeline=events" call actually enhance the list for document abc123? What I miss is how it guarantees last 3 restriction. Can you expand the idea a bit more please?

Business case is to have a fast storage for recent user actions for further processing in real time manner per user on demand. We have long term storage but that's about dozens of thousands of actions per user, terabytes of data, it covers years. But you can imagine it is not so fast to query and also it holds data that is not relevant to the case. The goal is to store only last user activities and be able to get it per user in milliseconds.

Are you sure about this?

That looks like a good fit for time based indices. Where you can move the older indices, let say after one month or so on colder nodes...

Have a look at ILM: Manage the index lifecycle | Elasticsearch Guide [8.13] | Elastic

As the "business case" is actually a "technical case", I'd not try to solve a problem that does not really exist... I mean that trying to be fast is not something to solve on the business side but on the technical side.

Dear David,

Yes, we did measure that and depending on dataset it can take from seconds to minutes to get necessary data from long term storage.

Sure, you can say it is a technical case even so from product perspective it is required to stay real time in current data flow processing and based on a current user action + last X actions the system needs to evaluate things and provide particular services based on outcome.

Yes, thank you. I've already looked at ILM. It can help with having documents for recent time range, e.g. last 30 days or last half year. But I miss how it can solve last X actions per user task.

What do you mean? What is the technical hardware involved in your cluster?

I think it is irrelevant to original topic, just note it is not Elasticsearch. Data is stored separately different way but it is not as fast to query and it contains a lot of extra that is not actual for the task. That's why we consider Elasticsearch as a dedicated storage for fast lookup and to only store relevant things. Maybe your point is to not nail it to last 128 documents per user and keep more, still can be fast in terms of search. We just don't have a need to keep more and that can require more disc space / more hardware / bigger cluster without a need.

So there's no automatic way in Elasticsearch to do this.
You will probably have to fetch the latest data for a given user, then get the _id of the oldest document and then delete it by id.

So:

POST index/_doc
GET /index/_search
DELETE /index/_doc/xyz

I see, thank you. I appreciate your help.