Threshold selection - How to define it?

Thomas74 · July 1, 2020, 8:13am

Hi,

I have 4 data nodes with 25 Tb on each for a total of 100Tb of storage.

I want to change the cluster threshold because, by default, the option

cluster.routing.allocation.disk.watermark.flood_stage is at 95%.

That will mean I lost 1 250 Gb of data per node and so 5Tb in total

I would like to change it but I don't know the best practice to have an optimized value.

I have 2 idea :

Based on my bigest index size (100Gb on primary index)

cluster.routing.allocation.disk.watermark.low : 200gb 
cluster.routing.allocation.disk.watermark.high : 150gb
cluster.routing.allocation.disk.watermark.flood_stage: 120gb

Total losted space : 120x4=480gb

Based on the fact that my machines use LVM and risk of a complete crash is quite limited.

cluster.routing.allocation.disk.watermark.low : 50gb 
cluster.routing.allocation.disk.watermark.high : 20gb
cluster.routing.allocation.disk.watermark.flood_stage: 15gb

Total losted space : 15x4=60Gb

Is my logic correct ?

Best regards,

Thomas R.

defalt · July 1, 2020, 8:29am

Do you use SSD's? Because if not I don't know why you want to risk data loss just because of 5TB. 5TB are only 5% of the price you payed for all that storage. Just invest another 150$ and you are fine. But thats just my opinion. Maybe there is a good reason for your question.

Thomas74 · July 1, 2020, 8:43am

We will reach our limit in few weeks and as I don't have more storage for now, I 'm trying to save some space with this.
Also, I will need to buy another server to increase my storage size. That implie hardware, licences, etc.. extra cost.
It's a shame to lose 1Tb per node I think but maybe I'm not right.

defalt · July 1, 2020, 8:52am

If so your settings should be okay but bear in mind the possible risk that comes with it. Also if you have only a few weeks left these 4,4% of extra space will give you only a couple of days more. So your investment has to be done either way. Default elasticsearch limitations are there for a reason, of course you can change them to your liking. The only problem that I see is that if you are at 99,4% full disks you will have no time to plan your further steps. If you have already planed what you will do when your disks reach that limit go for it^^.

Thomas74 · July 1, 2020, 9:26am

Ya I totally agree with you. We will have to invest etiher way.

Yes I think it also but I don't really understand the reason. For cluster/shard sizing, the common response is "it depends".

Why not the same answer also for this subject ?

Like taking in count :

Number of document/amount of data ingest per day
Size of the biggest index

If you lose 5Gb on 100Gb storage it doesn't matter but when you work with petabytes of data it start to be annoying.

There's for sure a reason for this but I don't really understand the logic behind ^^'

About further steps, instead of deleting indexes, what kind of steps you think that we can do ? I think about these actions but I'm not sure :

Reindex
Remove replicas
force_merge
Snapshot and delete

Maybe I'm going too further in my reflexions

defalt · July 1, 2020, 9:37am

is the standard answer of every elastic team member . I thing they train it on their first they when they join the company. But I think its just hard to create general rules which work for people using 1GB of data and some others using petabytes and supercomputers (who knows what elastic is used for). I think the logic is that if you run into these limitations that you still have time to take steps against it. Maybe if a company doesn't check their cluster state and the suddenly realise its close to full they can still increase this size and order new servers.

sounds good to me. Removing replicas is the first thing you should try. Maybe force merges will help but they will lead to longer query times and I dont think it helps that much. Snapshots are an idea. But you will need a lot of storage for them.

Thomas74 · July 1, 2020, 10:16am

Haha that's completly that

Yes maybe the best thing is to know your own "SLA" and configure it in function of your actions possibility.

I have a better idea on how I will work on it. Thanks

system · July 29, 2020, 10:16am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Cluster routing allocation Discussions en français	9	2221	July 6, 2017
How to avoid sending data to a node close to its disk space limit? Elasticsearch	4	635	April 15, 2018
Way to limit # of documents or storage size per node Elasticsearch	4	1674	July 6, 2017
Low disk watermark [15%] exceeded on Elasticsearch	13	5353	July 5, 2017
Understanding Disk-based Shard Allocation better Elasticsearch	11	1285	March 11, 2019

Threshold selection - How to define it?

Related topics