My entire cluster that back-ends my graylog deployment stop working over the weekend.
It complaining that /proj/graylog (The directory where I store ES data) is full. A df shows that it's only 50% full:
Type Size Used Avail Use% Mounted on
server:/graylog/graylog/ nfs 238G 117G 121G 50% /proj/graylog
curl -XGET 'http://localhost:9200/_cluster/allocation/explain?pretty'
{
"index" : "in_graylog_18",
"shard" : 1,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2017-10-23T09:52:46.654Z",
"failed_allocation_attempts" : 5,
"details" : "failed to create shard, failure FileSystemException[/proj/graylog/grayloges3/elasticsearch/graylog/nodes/0/indices/tkISaqIcQW21ZnwZrEj1kA/1/_state/state-1.st.tmp: No space left on device]",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions" : [
{
"node_id" : "kZnka3m4Tl-wwzG-MPBHOw",
"node_name" : "grayloges3",
"transport_address" : "10.66.8.202:9300",
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "TkTj0jxBRD68v8jQdGmB5A"
},
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2017-10-23T09:52:46.654Z], failed_attempts[5], delayed=false, details[failed to create shard, failure FileSystemException[/proj/graylog/grayloges3/elasticsearch/graylog/nodes/0/indices/tkISaqIcQW21ZnwZrEj1kA/1/_state/state-1.st.tmp: No space left on device]], allocation_status[deciders_no]]]"
}
]
}
]
}
I tried restarting es on each node but I have been unable to get it working again.
What process do I follow to get the cluster working again without losing any data?
It's a three node cluster running elasticsearch-5.5.2