ES FileSystemException (out of disk space)

My entire cluster that back-ends my graylog deployment stop working over the weekend.

It complaining that /proj/graylog (The directory where I store ES data) is full. A df shows that it's only 50% full:

Type Size Used Avail Use% Mounted on
server:/graylog/graylog/ nfs 238G 117G 121G 50% /proj/graylog

curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'
{
  "index" : "in_graylog_18",
  "shard" : 1,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2017-10-23T09:52:46.654Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed to create shard, failure FileSystemException[/proj/graylog/grayloges3/elasticsearch/graylog/nodes/0/indices/tkISaqIcQW21ZnwZrEj1kA/1/_state/state-1.st.tmp: No space left on device]",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "kZnka3m4Tl-wwzG-MPBHOw",
      "node_name" : "grayloges3",
      "transport_address" : "10.66.8.202:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "TkTj0jxBRD68v8jQdGmB5A"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2017-10-23T09:52:46.654Z], failed_attempts[5], delayed=false, details[failed to create shard, failure FileSystemException[/proj/graylog/grayloges3/elasticsearch/graylog/nodes/0/indices/tkISaqIcQW21ZnwZrEj1kA/1/_state/state-1.st.tmp: No space left on device]], allocation_status[deciders_no]]]"
        }
      ]
    }
  ]
}

I tried restarting es on each node but I have been unable to get it working again.

What process do I follow to get the cluster working again without losing any data?

It's a three node cluster running elasticsearch-5.5.2

I just found out from my storage admin that the disk was full (.snapshots was the culprit) this morning.

The disk is no longer full so how do I get the cluster to recover?

Ok I figured out how to recover the cluster...the magical command was: curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed'

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.