Elastic cluster running out of space, can't get cluster health to green

I have a 2 node elastic cluster. I noticed that I was not able to add more data to it, and it turned out the filesystem space is over at the flood limit on both machines.

So to try and recover from this I shut down the whole cluster, added another 100G to the master node (the first one). I also added a 3rd node.

Now what's happened is that the 2nd and 3rd node have filled up on space but the 1st node has a lot free . Servers are called elk1,elk2,and elk3. Here is the df output:

elk1:
/dev/mapper/elasticdata-elastic 295G 178G 102G 64% /elastic

elk2:
/dev/sdb1 196G 180G 5.9G 97% /elastic

elk3:
/dev/sdb1 196G 179G 7.7G 96% /elastic

This message is repeating in elk1

[2023-01-25T05:17:29,174][WARN ][o.e.c.r.a.DiskThresholdMonitor] [elk1] high disk watermark [96%] exceeded on [aG0V04CyRmuDsI9J2nkppQ][elk2][/elastic/nodes/0] free: 5.8gb[3%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete

When I run the shard explanation API:

{
  "index" : "logdata",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2023-01-24T17:26:58.800Z",
    "failed_allocation_attempts" : 2,
    "details" : "failed shard on node [aG0V04CyRmuDsI9J2nkppQ]: shard failure, reason [merge failed], failure NotSerializableExceptionWrapper[merge_exception: java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "DvpTvBh3RCyZvT0zrAfweQ",
      "node_name" : "elk3",
      "transport_address" : "165.88.130.21:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8072323072",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=90%], using more disk space than the maximum allowed [90.0%], actual free: [3.8833546828500873%]"
        }
      ]
    },
    {
      "node_id" : "aG0V04CyRmuDsI9J2nkppQ",
      "node_name" : "elk2",
      "transport_address" : "165.88.130.31:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8144543744",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=90%], using more disk space than the maximum allowed [90.0%], actual free: [3.0080296262767656%]"
        }
      ]
    },
    {
      "node_id" : "g-HQSgp6TCevlymvuzMRAA",
      "node_name" : "elk1",
      "transport_address" : "165.88.130.104:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8139481088",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "20"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[logdata][0], node[g-HQSgp6TCevlymvuzMRAA], [P], s[STARTED], a[id=j4C-sV38QCO6PVowk-Y0iw]]"
        }
      ]
    }
  ]
}

Index status:

health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logdata tJ9ZjFluTUWbLGGkAOiZqA   1   1  665303762     57066721    177.5gb        177.5gb

I'm not sure how to recover from this. I've been trying to delete documents that no longer needed, but I cant delete the entire index

There is similar topic here and here
ILM is what you need at the end.

You can try to delete some logs on disks or documents from ES, but it's time consuming, need by cautious. Make your query as you wish and test on unimportant or new test index.

POST indexname/_delete_by_query
{
 "query": {
       "range": {
            "@timestamp": {
              "lte": "2023-01-01T00:00:00.000Z",              
              "format": "strict_date_optional_time"
            }
          }
  }
}

I managed to solve the issue by first of all setting the replicas for my index to 0. This caused the cluster to go green.
Then I deleted all the unwanted documents. Finally I set the replicas back to 1. All is fine now.

This got me thinking, I run an extremely simple setup. Besides creating replicas, what other purpose can having an additional node serve? Does it give me any performance benefits? Or would it be okay to just run a single node?

Well done. I hope you will establish ILM.
On a single node you cannot have the replica.
3-nodes cluster guarantee availability+should be a little bit faster, shards are allocated across different nodes. (Disclaimer: I haven't tested which the configuration is faster. )

With multi-node cluster, your search time will definitely reduce if indexing is in progress. It also gives you more computational power on the whole cluster to make shards available for different purposes like dashboards/ visualiations/ watchers/ searches/ etc.

1 Like

Ayush is there perhaps some performances metrics for single node, 3, 5-10 nodes on 100 - 500 GB index pattern?

I believe you can rely on esrally provided by Elastic for benchmarking the clusters. You can generate different type of reports to have a better insight on your cluster sizing.