Environment
- Elasticsearch 7.8.0 running in EKS
- EFS filesystem
- 3 master nodes
- 5 data nodes
- 649 total shards
TL;DR
Receiving ALLOCATION_FAILED
on shards due to apparent disk quota issues, should have more than enough disk space, also not exceeding max number of shards per data node, AFAIK.
Full Story
I'm having an issue with unallocated replica shards. I'm receiving an error of "Disk quota exceeded." I'm using EFS and am showing 8E disk space available. I currently have a total of 649 shards spread among 5 data nodes.
I executed this query:
GET /_cluster/allocation/explain
{
"index": ".monitoring-kibana-7-2020.07.21",
"shard": 0,
"primary": false
}
...and got the following results:
{
"index" : ".monitoring-kibana-7-2020.07.21",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "ALLOCATION_FAILED",
"at" : "2020-07-21T13:26:31.961Z",
"failed_allocation_attempts" : 5,
"details" : "failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "Ku69PlSpQ8miS0CTrGYMXQ",
"node_name" : "elk-es-data-4",
"transport_address" : "10.229.27.241:9300",
"node_attributes" : {
"ilm_phase" : "hot",
"ml.machine_memory" : "8589934592",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id" : "KwJM0xwQQMC2rrs4zxfqvw",
"node_name" : "elk-es-data-0",
"transport_address" : "10.229.25.185:9300",
"node_attributes" : {
"ilm_phase" : "hot",
"ml.machine_memory" : "8589934592",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id" : "co-W7AfbSwWmABjAtw1LnQ",
"node_name" : "elk-es-data-2",
"transport_address" : "10.229.11.165:9300",
"node_attributes" : {
"ilm_phase" : "hot",
"ml.machine_memory" : "8589934592",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id" : "hKKp7AOiRAKAI5EnB-83sQ",
"node_name" : "elk-es-data-1",
"transport_address" : "10.229.4.130:9300",
"node_attributes" : {
"ilm_phase" : "hot",
"ml.machine_memory" : "8589934592",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
}
]
},
{
"node_id" : "heV_9oFRRQy6_VWNiFhaBg",
"node_name" : "elk-es-data-3",
"transport_address" : "10.229.16.67:9300",
"node_attributes" : {
"ilm_phase" : "hot",
"ml.machine_memory" : "8589934592",
"ml.max_open_jobs" : "20",
"xpack.installed" : "true",
"transform.node" : "true"
},
"node_decision" : "no",
"deciders" : [
{
"decider" : "max_retry",
"decision" : "NO",
"explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
},
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node [[.monitoring-kibana-7-2020.07.21][0], node[heV_9oFRRQy6_VWNiFhaBg], [P], s[STARTED], a[id=l7wnPB-ORFOAyw62WvMIoA]]"
}
]
}
]
}