Documents are no longer saved after high disk watermark exceeded on an elasticsearch cluster

Hi everyone,

I have a question about how load balance work on an elasticsearch servers cluster.
The cluster is composed by :

  • 4 "master, data" nodes,
  • 2 "data" nodes
  • 1 "coordinator" node

The warning "high disk watermark [90%] exceeded" is displayed in the logstash log file, and documents are no longer saved in the elasticsearch index.
After checking all the elasticsearch nodes in the cluster, it appears that for 2 nodes, disks are used over 90%.

So my question are :

  1. why elasticsearch does not send documents to the available nodes ? or is it a normal behavior to be bloked if a node is down (i don't think so...) ?
  2. I'm not an expert but it seems like a sharding problem due to the cluster's settings. May be i missed a configuration. Right ?
  3. what are the steps to "re up" my cluster and documents saving

In the logstash configuration file, output to elastic is set as :
output { elasticsearch { hosts => ["host_1:port", "host_2:port", "host_3:port",...."host_n:port"] } ...}
and communication between Logstash and the elasticsearch cluster is through SSL protocol.

For information, here is the reuslt of GET /_cat/indices/?v=true :

health status index                              uuid                   pri rep docs.count docs.deleted store.size
green  open   idx-aggregated-logs-000001 qJJE4BUTQYyj9ExIUegJBQ          1   1  511638025            0    747.4gb        373.7gb

And the result of GET /_cat/shards

.ds-ilm-history-5-2022.08.31-000004                           0 p STARTED         
.ds-ilm-history-5-2022.08.31-000004                           0 r STARTED         
idx-aggregated-logs-000001                            		  0 r STARTED 511638025 373.6gb
idx-aggregated-logs-000001                            		  0 p STARTED 511638025 373.7gb
.ds-.logs-deprecation.elasticsearch-default-2022.08.10-000001 0 p STARTED         
.ds-.logs-deprecation.elasticsearch-default-2022.08.10-000001 0 r STARTED         
.ds-.logs-deprecation.elasticsearch-default-2022.08.24-000002 0 p STARTED         
.ds-.logs-deprecation.elasticsearch-default-2022.08.24-000002 0 r STARTED         
.ds-ilm-history-5-2022.06.02-000001                           0 p STARTED         
.ds-ilm-history-5-2022.06.02-000001                           0 r STARTED         
.ds-ilm-history-5-2022.08.01-000003                           0 p STARTED         
.ds-ilm-history-5-2022.08.01-000003                           0 r STARTED         
.ds-ilm-history-5-2022.07.02-000002                           0 p STARTED         
.ds-ilm-history-5-2022.07.02-000002                           0 r STARTED         

Thx for your help,

  1. It will depend on where the shards for the index that is being written to belong. If they exist on a node that has gone over the watermark then the entire index is made read only
  2. Why do you ask that?
  3. If you now have enough space on your nodes - Error: disk usage exceeded flood-stage watermark, index has read-only-allow-delete block | Elasticsearch Guide [8.4] | Elastic

It looks like you have created a rollover index but not set up ILM to actually perform the rollover based on size and/or age. You therefore have ended up with a single huge index that on its own has breached the watermark. As this has one replica configured you have two shards that sit on two of the data nodes. As Elasticsearch manages data at the shard level it can not move part of the data to any other nodes.

The only way I can see to get out of this is to either add disk space to the nodes and then split the index or create a new index with a working ILM policy with configured rollover and reindex the data from the current index into this. The new indices should go to the two nodes currently not holding much data and you should end up with multiple indices backed by shards of a reasonable size. Once the data has been reindexed you should delete the current index and Elasticsearch will then be able to distribute the data across all 4 data nodes as you will have a larger number of smaller shards.

Thank you Christian for your answer.

Effectively, an ILM policy was already configured for the index.

The result for GET _ilm/policy is :

"policy": {
            "phases": {
                "hot": {
                    "min_age": "0ms",
                    "actions": {
                        "rollover": {
                            "max_size": "50gb",
                            "max_age": "365d"

So I add this property :

"max_primary_shard_size": "50gb",

After reindexing, the result of GET _cat/shards/idx*?v=true is :

 		  index            	shard 	prirep 	  state        docs   	store 	ip             	node
idx-aggregated-logs-000002 	  0   	  p      STARTED   	  9861998   3.2gb
idx-aggregated-logs-000002 	  0       r      STARTED   	  9861998   3.4gb
idx-aggregated-logs-000001 	  0       r      STARTED 	  511638025 373.6gb
idx-aggregated-logs-000001 	  0       p      STARTED 	  511638025 373.7gb

Why the new created index store only 3.2gb ?
I expected that after reindexing, all the data would be split into multiple indexes on multiple nodes.
If i delete the current index, am i going to lose all data store in it ?


You have performed a rollover, not any reindexing. By fixing the ILM policy rollover has kicked in and a new backing index has been created. This is now accepting data and should rollover to a new index once it gets to 50GB in size. This is located on the nodes without storage issues, so your cluster should be able to accept data for a while now.

You have however not done anything with the old index, which is huge. You have 2 options. You could simply leave it there for a while and then delete it once the data is no longer needed. That would make you lose all that data. The other approach is to reindex the data into new indices using the reindex API. This will require additional space and may not be possible unless you drop the replica and free up space that way.

After updating the rollover by adding the max_primary_shard_size parameter, i executed this reindexing query :
POST _reindex?wait_for_completion=false

The wait_for_completion parameter is to avoid a timeout.
Body :

    "source": {
        "index": "idx-aggregated-logs-000001"
    "dest": {
        "index": "idx-aggregated-logs-000002"

When i check the shards information, i have :

index                        shard      prirep    state        docs          store
idx-aggregated-logs-000002   0          p         STARTED      9861998       2.8gb 
idx-aggregated-logs-000002   0          r         STARTED      9861998       3.7gb

I thought that the reindex was going to distribute data from index 1 to shards of index 2.

That is not good. Try to cancel that reindexing task.

A better approach may be to set up a new rollover index with a different name and a valid, correct ILM policy and reindex from the huge index into this. Do not index into the index name though - make sure you index into the write alias so rollover can create multiple backing indices of suitable size.

If you want to reindex the data in the large index you need to write to the write alias so you get multiple backing indices, not the underlying index name. This will not remove any data from the huge index, just add it to the new index.

Hi Christian,
I'm lost.... :cry:
You said above that fixing the ILM policy rollover will create a new index that is should rollover once it gets 500GB in size.
So, last monday we have fixed the ILM policy, and redeployed from scratch our ES cluster.
(12 data nodes, 10 master nodes and 2 clients nodes). Cluster health was green, logs are gathered and stored in ES.
Unfortunatelly, on friday we noticed that the used disk space was more than 90% (350gb) and ES was not load balancing docs to another available node.

We discussed setting the shard size threshold to 50GB, not 500GB.

The cluster you described had 2 data nodes, not 10. If you have 10 data nodes you may want to increase the number of primary shards for the index to e.g. 5 so there are enough shards to distribute across the cluster. If your index only has 1 primary shard (which is suitable when you have 2 data nodes) that and the replica will ever only occupy 2 nodes.

This makes no sense. You should aim to have 3 dedicated master nodes in the cluster.

thanks for your availability Christian.
Sorry, you are right, its not 500 but 50gb. it's just a typo :rofl:

The GET /_cluster/stats return :

"nodes": { "count": { "total": 14, "coordinating_only": 2, "data": 12, "data_cold": 0, "data_content": 0, "data_frozen": 0, "data_hot": 0, "data_warm": 0, "ingest": 0, "master": 10, "ml": 0, "remote_cluster_client": 0, "transform": 0, "voting_only": 0 } }

GET /_cluster/health query return :
{ "cluster_name": "cluster-PPR_HML", "status": "green", "timed_out": false, "number_of_nodes": 14, "number_of_data_nodes": 12, "active_primary_shards": 3, "active_shards": 6, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 0, "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 100.0 }

And the GET /_nodes/stats return :
10 nodes with roles data, master
2 nodes with roles data
and 2 nodes with empty role (i assume these are the coordinating nodes)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.