Data compression and retention

krzychohoho · March 22, 2023, 1:36pm

Hi,
I have a task in which i need to store data in elastic cluster. Altogether i have 12TB of disk space available. These are the requirments:

90 days of data retention
One replica shard per one primary shard
100GB of data per day

How should i adjust the cluster so that it is possible to do it. How would you adjust the compression algorithms and is it possible to set highest compression rate algorithm for replica shards and default one for primary shards? How should I adjust ilms. Should i change compression rate for different phases? Is there anything else i should consider. What are your suggestions? Anything. this is kind of a task that needs to be done. I also need to take performance into consideration. It must be usable.

Thank you for any tips in advance

leandrojmp · March 22, 2023, 1:47pm

You have 12 TB only for the data nodes or this is the total disk space you have in your machines and it will also be used by operating system installation etc ?

Just taking in consideration your requirements you would need 18 TB of disk space, you are at least 6TB short, and keep in mind that you cannot use 100% of the disk of a data node.

Per default after a node reachs 85% of disk usage it stops allocation shards on this node, this can changed to use better the space, but you will need a margin of non used space on each node.

You can't have different compressions for replicas and primary, the compression is set on the index level and I'm not sure that it will do that much difference.

To have some suggestions You will need to provide more context about your cluster, how many nodes you have, what are the specs of each node etc, or do you do not have a working cluster yet?

krzychohoho · March 22, 2023, 2:04pm

Hi, Thank you for quick response.

I have 4 all-in-nodes each has 3tb of disk space for data storage. This is something that i cannot change and my task is to optimize this to work with 100GB per day for at least 90 days. What can i do to optimize this. This is not something that i can change. I am aware that these are stupid requirments but this is how it is. What could i do to pack it up to 12TB?
+

leandrojmp · March 22, 2023, 2:40pm

With 3 TB disks on each node, per default you would be able to use 85% of this size to store your data, this would give you something like 2.5 TB per disk, which in total would give you 10 TB of usable space.

So, the first thing you should do to change the watermark levels used by your cluster to increase the amount of usable disk space.

To make things easier, let's assume that you can use all the 3 TB of each disk.

Even using the entire 12 TB you still wouldn't be able to store 90 days of data with replicas considering 100 GB per day, you would need at least 18 TB.

You would need then to reduce the size of your indices and there are a couple of ways to do that, the first one, which I would consider mandatory, is to check your mappings.

If you are using dynamic mappings in your indices you probably are wasting space because string fields are mapped twice, as keyword and text, so you would need to check your data and map your fields accordingly to their use, this can help you reduce the index size.

Also, you didn't provide any information on how you are indexing your data, but another thing that helps is if you are storing the original message after parsing it or not, I would suggest that if you are storing the original message, you start removing it after parsing, this also can reduce the size of your index (by a huge percent in most cases).

Another option would be to change the compression of the index after sometime, for example, using ILM to change the compression after 30 days.

I'm not sure that it would help much as this depends a lot on your data and the default compression is already pretty good, but you will need to try it, just keep in mind that it will also have some impact in the performance.

There is no magic, with your requirements you need at least 18 TB of disk, which you do not have, so you will need to test some things to see if this reduces the size of your indices and this is assuming that your daily data ingestion will not change.

system · April 19, 2023, 2:41pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data storage estimation for ES cluster Elasticsearch	8	11079	April 11, 2017
Elasticsearch Disk space issues Elasticsearch	5	3487	June 1, 2017
Elasticsearch Cluster for distributed mode Elasticsearch	4	1091	July 5, 2017
Capacity planning for 200GB data /day and retention period of 30 days Elasticsearch	1	562	April 6, 2017
Sizing Elastic Elasticsearch	7	2984	December 28, 2017

Data compression and retention

Related topics