Data storage stragety

Hi,

I wonder if I already have couple indices that contains a big size of data.
I wonder if using snapshot or best compression are good ways to help reduding the size of the existing indices.

I have looked at the document it states " Snapshots are automatically deduplicated to save storage space and reduce network transfer costs. To back up an index, a snapshot makes a copy of the index’s segments and stores them in the snapshot repository."

I wonder the way it help to save disk storage is to take a snapshot of an index, and then remove that index, whenever we need that index, we can restore from snapstore.

And for the best compression, I wonder if it is used for the future document indexing into an index, so the way of document storage can be compressed to the most minimun size.

I was about to answer your other question related to disk space in general but you removed it :frowning:

You seems to be looking for ideas to reduce the resources needed for your cluster, so I might not answer directly to your questions but more give some ideas.

Hot, Warn, Cold, Frozen, Delete:

  • Hot: The index is actively being updated and queried.
  • Warm: The index is no longer being updated but is still being queried.
  • Cold: The index is no longer being updated and is queried infrequently. The information still needs to be searchable, but it’s okay if those queries are slower.
  • Frozen: The index is no longer being updated and is queried rarely. The information still needs to be searchable, but it’s okay if those queries are extremely slow.
  • Delete: The index is no longer needed and can safely be removed.

Depending on your use case, you can think about using those phases to perform some changes to reduce the disk space. For example, for logs, I'd probably not reduce anything in the hot phase but probably in the warm or cold phases.

If you are snapshotting an index on S3 or similar, and if your index is not updated anymore (again timeseries data), you can move to the Frozen phase (needs a commercial license) or the Delete phase. That way, you won't consume anymore costly resources like SSD space but you will offload your data to S3 which is a way cheaper... With Frozen (and searchable snapshots feature), you can still search within S3 data. With Delete, you will need to restore first (and consume again some disk space) before being able to search.

If you don't need anymore the original data but only the aggregated view, you can downsample your index.

Aggregates a time series (TSDS) index and stores pre-computed statistical summaries (min , max , sum , value_count and avg ) for each metric field grouped by a configured time interval. For example, a TSDS index that contains metrics sampled every 10 seconds can be downsampled to an hourly index. All documents within an hour interval are summarized and stored as a single document and stored in the downsample index.

In the Warn or Cold phase, you can set the number of replicas to 0, assuming that you have backups in case of any problem...

If you have many primary shards in the Hot phase, you can also use the Shrink API to reduce the number of shards to 1 and do a force merge. That will help to reduce also the disk space. If before running the force merge, you set index.codec to best_compression, that might also help (See Index modules | Elasticsearch Guide [8.11] | Elastic). Please be aware that all that will consume a lot of IO because all the segments needs to be rewritten...

Those are some ideas which I hope will help you to find what is best for you...

Hi @dadoonet ,

Thanks for the detailed reply!

I have looked the document about ILC, I am a bit confused about the deleted phared, if delete pharsed is not really delete the data but can later be restored back to disk, where is the deleted data go,

I wonder if it is also ok to do Shrink/Force merge and set index.codec to best_compression on the indices that generate by APM server

Delete removes the index. If you have snapshots, snapshots are not deleted by ILM.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.