Elasticsearch cluster goes down with 1 node capacity reached, resolutions

I have a 2 node cluster, out of which 1 node has reached max capacity (96%). Both nodes have different capacity. It gives the following error:

TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block

As per my current understanding ES works in this way, if any node goes to watermark stage the complete cluster (including all nodes) will stop ingesting new data.

So what can be possible solutions other than deleting existing data? Can I create a new data mount point and add it to the node's elasticsearch.yml ? Will this be resolved? Or is there any other solution to this?

Currently my replication factor is 1. 2 nodes in cluster.

Any help here would be appreciated

Not exactly, when a node reaches the flood stage watermark, it will mark every index that has at least one shard on that node as read only.

But in your case since you have just 2 nodes and have replicas, this means that all your indices will be marked as read-only.

I would say that in your case the easiest solution is to remove all replicas since a 2 node cluster does not have any resilience.

As mentioned, a 2 node cluster does not have any resilience, so it does not make much difference having replicas.

You can remove the replicas using the following request:

PUT /*/_settings
{
    "index" : {
        "number_of_replicas" : 0
    }
}
2 Likes

Thanks for replying @leandrojmp . Can you please help explaining. Why is 2 node cluster not resilient? If 1 node's goes down will the second not be able to support it?

Just trying to understand things a little better for my perspective.

No, basically Elasticsearch needs a quorum to elect a master node, with just 1 node up you will not have that quorum.

You end up with 2 scenarios.

  • If the node that goes down is not the current master, your cluster will still work because the master is up.
  • If the node that goes down is the current master, your cluster will not work until the node gets back online.

You can read more here.

Thank you!! This helps. So if I change the replication factor to 0, what will happen to the current data? And the future data that gets ingested on the nodes?