Elastic cluster running out of space, can't get cluster health to green

lostsoul352 · January 25, 2023, 10:22am

I have a 2 node elastic cluster. I noticed that I was not able to add more data to it, and it turned out the filesystem space is over at the flood limit on both machines.

So to try and recover from this I shut down the whole cluster, added another 100G to the master node (the first one). I also added a 3rd node.

Now what's happened is that the 2nd and 3rd node have filled up on space but the 1st node has a lot free . Servers are called elk1,elk2,and elk3. Here is the df output:

elk1:
/dev/mapper/elasticdata-elastic 295G 178G 102G 64% /elastic

elk2:
/dev/sdb1 196G 180G 5.9G 97% /elastic

elk3:
/dev/sdb1 196G 179G 7.7G 96% /elastic

This message is repeating in elk1

[2023-01-25T05:17:29,174][WARN ][o.e.c.r.a.DiskThresholdMonitor] [elk1] high disk watermark [96%] exceeded on [aG0V04CyRmuDsI9J2nkppQ][elk2][/elastic/nodes/0] free: 5.8gb[3%], shards will be relocated away from this node; currently relocating away shards totalling [0] bytes; the node is expected to continue to exceed the high disk watermark when these relocations are complete

When I run the shard explanation API:

{
  "index" : "logdata",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2023-01-24T17:26:58.800Z",
    "failed_allocation_attempts" : 2,
    "details" : "failed shard on node [aG0V04CyRmuDsI9J2nkppQ]: shard failure, reason [merge failed], failure NotSerializableExceptionWrapper[merge_exception: java.io.IOException: No space left on device]; nested: IOException[No space left on device]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "DvpTvBh3RCyZvT0zrAfweQ",
      "node_name" : "elk3",
      "transport_address" : "165.88.130.21:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8072323072",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=90%], using more disk space than the maximum allowed [90.0%], actual free: [3.8833546828500873%]"
        }
      ]
    },
    {
      "node_id" : "aG0V04CyRmuDsI9J2nkppQ",
      "node_name" : "elk2",
      "transport_address" : "165.88.130.31:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8144543744",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=90%], using more disk space than the maximum allowed [90.0%], actual free: [3.0080296262767656%]"
        }
      ]
    },
    {
      "node_id" : "g-HQSgp6TCevlymvuzMRAA",
      "node_name" : "elk1",
      "transport_address" : "165.88.130.104:9300",
      "node_attributes" : {
        "ml.machine_memory" : "8139481088",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "20"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[logdata][0], node[g-HQSgp6TCevlymvuzMRAA], [P], s[STARTED], a[id=j4C-sV38QCO6PVowk-Y0iw]]"
        }
      ]
    }
  ]
}

Index status:

health status index    uuid                   pri rep docs.count docs.deleted store.size pri.store.size
yellow open   logdata tJ9ZjFluTUWbLGGkAOiZqA   1   1  665303762     57066721    177.5gb        177.5gb

I'm not sure how to recover from this. I've been trying to delete documents that no longer needed, but I cant delete the entire index

Rios · January 25, 2023, 10:35am

There is similar topic here and here
ILM is what you need at the end.

You can try to delete some logs on disks or documents from ES, but it's time consuming, need by cautious. Make your query as you wish and test on unimportant or new test index.

POST indexname/_delete_by_query
{
 "query": {
       "range": {
            "@timestamp": {
              "lte": "2023-01-01T00:00:00.000Z",              
              "format": "strict_date_optional_time"
            }
          }
  }
}

lostsoul352 · January 26, 2023, 12:16pm

I managed to solve the issue by first of all setting the replicas for my index to 0. This caused the cluster to go green.
Then I deleted all the unwanted documents. Finally I set the replicas back to 1. All is fine now.

This got me thinking, I run an extremely simple setup. Besides creating replicas, what other purpose can having an additional node serve? Does it give me any performance benefits? Or would it be okay to just run a single node?

Rios · January 26, 2023, 12:58pm

Well done. I hope you will establish ILM.
On a single node you cannot have the replica.
3-nodes cluster guarantee availability+should be a little bit faster, shards are allocated across different nodes. (Disclaimer: I haven't tested which the configuration is faster. )

Ayush_Mathur · January 26, 2023, 2:25pm

With multi-node cluster, your search time will definitely reduce if indexing is in progress. It also gives you more computational power on the whole cluster to make shards available for different purposes like dashboards/ visualiations/ watchers/ searches/ etc.

Rios · January 26, 2023, 9:23pm

Ayush is there perhaps some performances metrics for single node, 3, 5-10 nodes on 100 - 500 GB index pattern?

Ayush_Mathur · January 27, 2023, 9:22am

I believe you can rely on esrally provided by Elastic for benchmarking the clusters. You can generate different type of reports to have a better insight on your cluster sizing.

system · February 24, 2023, 9:23am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch no disk space available Elasticsearch	3	1031	July 5, 2017
Shard reallocation and disk space Elasticsearch	5	931	August 4, 2020
Unable to make my cluster state as Green even with more than 1 nodes Elasticsearch	5	695	July 5, 2017
Unbalanced cluster - one node running out of space Elasticsearch	7	2894	July 5, 2017
Cluster in yellow for a week right now, several disk without space Elasticsearch	3	385	July 6, 2017

Elastic cluster running out of space, can't get cluster health to green

Related topics