Greetings, I was given a broken graylog cluster to fix. Upon examination, elasticsearch cluster (7 node: 1 master 6 data) was red as its fs was 100%. After clearing older data, I started noticing unassigned shards error. I kinda figured out what that meant by looking it in the web and I tried a lot of fix but didn't work. Here are some of the things I tried.
- After fs space was made available(1tb each total, used 50% now), I restarted all the data nodes - Unassigned shards no. increased initially but then subsided to original figure 1800
- I ran curl -s -X POST 'http://:9200$es-master/_cluster/reroute?retry_failed=true&pretty' - I get a lot of state started output with allocation id. it does assign a few like 10-15 but thats it.
- I also made sure allocation is enabled on all data nodes.
C02RT2M5FVH6:~ mahars01$ curl -s "dfwlnpgles-10:9200/_cluster/settings?pretty"
{
"persistent" : { },
"transient" : {
"cluster" : {
"routing" : {
"allocation" : {
"enable" : "all"
}
}
}
}
}
But none of these seem to help.
Here are some info:
C02RT2M5FVH6:~ mahars01$ curl dfwlnpgles-10:9200/_cat/health?v
epoch timestamp cluster status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1600112243 14:37:23 elasticsearch2 red 7 7 1311 1303 0 2 1749 0 - 42.8%
C02RT2M5FVH6:~ mahars01$ curl dfwlnpgles-10:9200/_cat/nodes
10.5.4.78 17 99 53 5.17 6.27 6.16 mdi - dfwlnpgles-16
10.5.4.73 18 98 3 0.28 0.20 0.15 mdi - dfwlnpgles-11
10.5.4.75 25 99 58 6.54 7.17 7.10 mdi - dfwlnpgles-13
10.5.4.74 57 99 88 7.69 6.90 5.92 mdi - dfwlnpgles-12
10.5.4.77 30 99 33 1.82 1.74 1.77 mdi - dfwlnpgles-15
10.5.4.76 15 98 28 1.95 1.87 1.83 mdi - dfwlnpgles-14
10.5.4.72 44 41 7 0.17 0.10 0.13 mdi * dfwlnpgles-10
C02RT2M5FVH6:~ mahars01$ curl -s dfwlnpgles-10:9200/_cat/indices | wc -l
632
I check one of the unassigned shards
C02RT2M5FVH6:~ mahars01$ curl 'http://dfwlnpgles-10:9200/_cluster/allocation/explain?pretty' -d '{
"index": "weekly_1372",
"shard": 3,
"primary": true
}'
{
"index" : "weekly_1372",
"shard" : 3,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-09-03T18:22:53.137Z", #NOTE: this date is from the time when fs was 100%
"failed_allocation_attempts" : 5,
"details" : "failed to create shard, failure IOException[No space left on device]",
"last_allocation_status" : "no"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
"node_allocation_decisions" : [
{
"node_id" : "12SVd1opTqukG9D5q8CAwg",
"node_name" : "dfwlnpgles-10",
"transport_address" : "10.5.4.72:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "76V1negYQOK-cZOMLJAifw",
"node_name" : "dfwlnpgles-13",
"transport_address" : "10.5.4.75:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "7Lh9tonFTWeZ6aPF4tlF3g",
"node_name" : "dfwlnpgles-16",
"transport_address" : "10.5.4.78:9300",
"node_decision" : "no",
"store" : {
"in_sync" : true,
"allocation_id" : "MFcr1BX9T8yN2ihCiGsyPQ"
},
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-09-03T18:22:53.137Z], failed_attempts[5], delayed=false, details[failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]"
}
]
},
{
"node_id" : "Dcks1X5LSxy_RtrSw_GoTw",
"node_name" : "dfwlnpgles-12",
"transport_address" : "10.5.4.74:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "Qguk__nnSXmT8e9bny19VQ",
"node_name" : "dfwlnpgles-11",
"transport_address" : "10.5.4.73:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
},
{
"node_id" : "XuGZf3x3RTqb71eQEbWWdQ",
"node_name" : "dfwlnpgles-15",
"transport_address" : "10.5.4.77:9300",
"node_decision" : "no",
"store" : {
"in_sync" : false,
"allocation_id" : "jAKEjQS-QEy_jCuQS13s0g",
"store_exception" : {
"type" : "file_not_found_exception",
"reason" : "no segments* file found in SimpleFSDirectory@/app/data/nodes/0/indices/blsAn4FlQmyzPtkp6y1xcw/3/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@7824675a: files: [write.lock]"
}
}
},
{
"node_id" : "gfDZofDKQwmhswnGAiFKag",
"node_name" : "dfwlnpgles-14",
"transport_address" : "10.5.4.76:9300",
"node_decision" : "no",
"store" : {
"found" : false
}
}
]
}
Anything more I can provide here?