I want to control the number of documents in our index, so I'm planning on running a delete by query periodically. Our documents look something like this:
{
"client": "a51b8afa8710ccbad1",
"createdAt": "2021-01-19T13:45:21.000Z",
"otherfields": "a lot of data"
}
For our use case only the most recent 10k documents per client are relevant. Given our large number of clients, we want to make sure we're keeping the index size in check by deleting those past that limit.
I haven't been able to come up with a query that would give me the Nth document after aggregating by client and sorting by date. I don't necessarily need to do this in a single query, but what would be the recommended approach for this?
Why base the rule on the most recent X documents? Could you do it based on a date or index size?
You could use delete_by_query with a range query to delete documents older than a certain date, for example.
Managing index size is what ILM is for, and could probably do what you need automatically. Roll over based on document count, then delete. You'll need to be using index templates for this, though.
Why base the rule on the most recent X documents? Could you do it based on a date or index size?
We're building a recommendation system that can only consider 10k documents at most —beyond that too much noise is introduced and the quality starts to drop.
I failed to mention we're already cleaning up documents older than one month using a range query, as you suggest. However, there are some clients that index many more than 10k documents on a given month, so we still need to control for that if we want to make sure we're not considering too much data.
Thanks for the Rollover pointer, that's interesting. However, given that we have a single index containing documents from different organizations, is there something we can do to limit the number of doucments per client in the index?
After some consideration, the interim solution I'll be implementing will be based in a date histogram. I can sum the bucket counts for each interval in each client, and locate the day in which documents go beyond the limit. Then delete everything for that client older than that date.
This still has some limitations —we won't be keeping exactly 10k, and clients with big daily volume will surely pass the limit— but it works for now. We can always increaase the interval granularity for specific clients if needed.
Still open to suggestions for a more solid approach though!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.