Threshold selection - How to define it?

Hi,

I have 4 data nodes with 25 Tb on each for a total of 100Tb of storage.

I want to change the cluster threshold because, by default, the option

cluster.routing.allocation.disk.watermark.flood_stage is at 95%.

That will mean I lost 1 250 Gb of data per node and so 5Tb in total :confused:

I would like to change it but I don't know the best practice to have an optimized value.

I have 2 idea :

  1. Based on my bigest index size (100Gb on primary index)
cluster.routing.allocation.disk.watermark.low : 200gb 
cluster.routing.allocation.disk.watermark.high : 150gb
cluster.routing.allocation.disk.watermark.flood_stage: 120gb 

Total losted space : 120x4=480gb

  1. Based on the fact that my machines use LVM and risk of a complete crash is quite limited.
cluster.routing.allocation.disk.watermark.low : 50gb 
cluster.routing.allocation.disk.watermark.high : 20gb
cluster.routing.allocation.disk.watermark.flood_stage: 15gb 

Total losted space : 15x4=60Gb

Is my logic correct ?

Best regards,

Thomas R.

Do you use SSD's? Because if not I don't know why you want to risk data loss just because of 5TB. 5TB are only 5% of the price you payed for all that storage. Just invest another 150$ and you are fine. But thats just my opinion. Maybe there is a good reason for your question.

We will reach our limit in few weeks and as I don't have more storage for now, I 'm trying to save some space with this.
Also, I will need to buy another server to increase my storage size. That implie hardware, licences, etc.. extra cost.
It's a shame to lose 1Tb per node I think but maybe I'm not right.

If so your settings should be okay but bear in mind the possible risk that comes with it. Also if you have only a few weeks left these 4,4% of extra space will give you only a couple of days more. So your investment has to be done either way. Default elasticsearch limitations are there for a reason, of course you can change them to your liking. The only problem that I see is that if you are at 99,4% full disks you will have no time to plan your further steps. If you have already planed what you will do when your disks reach that limit go for it^^.

Ya I totally agree with you. We will have to invest etiher way.

Yes I think it also but I don't really understand the reason. For cluster/shard sizing, the common response is "it depends".

Why not the same answer also for this subject ?

Like taking in count :

  • Number of document/amount of data ingest per day
  • Size of the biggest index

If you lose 5Gb on 100Gb storage it doesn't matter but when you work with petabytes of data it start to be annoying.

There's for sure a reason for this but I don't really understand the logic behind ^^'

About further steps, instead of deleting indexes, what kind of steps you think that we can do ? I think about these actions but I'm not sure :

  • Reindex
  • Remove replicas
  • force_merge
  • Snapshot and delete

Maybe I'm going too further in my reflexions :slight_smile:

is the standard answer of every elastic team member :wink:. I thing they train it on their first they when they join the company. But I think its just hard to create general rules which work for people using 1GB of data and some others using petabytes and supercomputers (who knows what elastic is used for). I think the logic is that if you run into these limitations that you still have time to take steps against it. Maybe if a company doesn't check their cluster state and the suddenly realise its close to full they can still increase this size and order new servers.

sounds good to me. Removing replicas is the first thing you should try. Maybe force merges will help but they will lead to longer query times and I dont think it helps that much. Snapshots are an idea. But you will need a lot of storage for them.

Haha that's completly that :smiley:

Yes maybe the best thing is to know your own "SLA" and configure it in function of your actions possibility.

I have a better idea on how I will work on it. Thanks :slight_smile: