Elasticsearch efficiently cleaning up the indices to save space

0
down vote
favorite
v Elasticsearch 5.6.*.

I'm looking at a way to implement a mechanism by which one of my index (that grows big in no time about 1 million documents per day) to manage the storage constraints automatically.

For example: I will define the max no of documents or max index size as a variable 'n'. I'd write a scheduler that checks whether 'n' is true. If true, then I'd want to delete the oldest 'x' documents (based on time).

I have a couple of questions here:

Apparently, I do not want to delete too much or too less. How would I know what 'x' is? Can I simply say to elasticsearch that "Hey delete the oldest documents worth 5GB" - Is this possible? - My intent is to simply free up a fixed amount of storage.

Secondly, I'd want to know what's the best practice here? Obviously I don't want to invent a square wheel here and if there's anything (eg: Curator and I've been hearing about it only recently) that does the job then I'd be happy to use it.

If your index is always growing, then deleting documents is not best practices. It sounds like you have time-series data. If true, then what you want is time-series indices, or better yet, rollover indices.

5GB is also a rather small amount to be purging, as a single Elasticsearch shard can healthily grow to 20GB - 50GB in size. Are you storage constrained? How many nodes do you have?

Thank you for the reply. Yes, we're storage constrained. 5GB is just as an example. We have 3 nodes and in a week's time, we'd have hit 50 to 60 GB roughly. We want an automated way of cleaning up the index (by space) so that the system is completely protected from running out of space.

Curator can help with that, but only if you're deleting indices, not deleting documents from indices.

Can you give a little more information about why someone wouldn't want to delete documents as a means of restricting the size of an index?

Deleting documents results in a multi-stage, behind-the-scenes process.

  1. The documents to be deleted by age would have to be searched.
  2. The document IDs selected are marked for deletion. They are not deleted immediately. They continue to consume space until a segment merge (which tend to be fairly frequent, but still not immediate).
  3. During a segment merge (which happen on a regular basis, without needing to be triggered), the documents marked for deletion are not reindexed into the new segments, effectively deleting them. The segment merge process is usually quick, but if there are a high number of documents to delete, it will take longer than a few moments, and that will affect search and indexing speed while it's going on.

On the other side of the equation, deleting an entire index:

  1. The index, and all associated segments, and the documents contained therein are deleted immediately.
  2. There is no step 2.

If you think of it in SQL terms, it's semi-analogous to a SQL drop table vs. doing a SQL delete clause. The index deletion is like dropping a TABLE, where the document delete method is more like DELETE from TABLE where timestamp < XXXX, which triggers a lot of individual atomic operations. In SQL, a DROP TABLE is always going to be more performant than doing a bunch of atomic deletes. The same is true in Elasticsearch.

2 Likes

Sure, my main constraint here is the storage. I'd want to definitively say [for example] "Free up 20G of space" which would then delete the oldest documents worth 20G. If I delete the documents, then I wouldn't know how many documents make up to 20G.

Makes sense, thanks for that. I was looking at the "Roll over pattern". I guess this approach could for work me (although it doesn't directly support the size). But I guess it's easy to work out though.

Basically, I can:

  • Periodically partition the index based on (age/no of documents) - using the roll over api.

  • Periodically check the historically rolled over indices and delete them by age or size (which might result in unintended documents to be deleted though).

Question: Any downside of having too many indices (physical/concrete indices as a result of roll ups) for an alias? I understand every index has a maintenance overhead. Is this recommended?

Too many is subject to the amount of RAM you have on your data nodes.

For a node with a 30G heap, you can safely have around 500 - 700 shards. This value scales downward corresponding to the amount of heap you have.

You can address this by ensuring that you aren't using the default number of shards (5 primary, 1 replica per primary) unless you have enough nodes to really spread that around. 1 primary + 1 replica will be just fine with the smaller indices you look like you will be making.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.