Elasticsearch Shard Allocation - ALLOCATION_FAILED due to apparent disk quota issues, nowhere near max

Environment

  • Elasticsearch 7.8.0 running in EKS
  • EFS filesystem
  • 3 master nodes
  • 5 data nodes
  • 649 total shards

TL;DR

Receiving ALLOCATION_FAILED on shards due to apparent disk quota issues, should have more than enough disk space, also not exceeding max number of shards per data node, AFAIK.

Full Story

I'm having an issue with unallocated replica shards. I'm receiving an error of "Disk quota exceeded." I'm using EFS and am showing 8E disk space available. I currently have a total of 649 shards spread among 5 data nodes.

I executed this query:

GET /_cluster/allocation/explain
{
  "index": ".monitoring-kibana-7-2020.07.21",
  "shard": 0,
  "primary": false
}

...and got the following results:

{
  "index" : ".monitoring-kibana-7-2020.07.21",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-07-21T13:26:31.961Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "Ku69PlSpQ8miS0CTrGYMXQ",
      "node_name" : "elk-es-data-4",
      "transport_address" : "10.229.27.241:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "KwJM0xwQQMC2rrs4zxfqvw",
      "node_name" : "elk-es-data-0",
      "transport_address" : "10.229.25.185:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "co-W7AfbSwWmABjAtw1LnQ",
      "node_name" : "elk-es-data-2",
      "transport_address" : "10.229.11.165:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "hKKp7AOiRAKAI5EnB-83sQ",
      "node_name" : "elk-es-data-1",
      "transport_address" : "10.229.4.130:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "heV_9oFRRQy6_VWNiFhaBg",
      "node_name" : "elk-es-data-3",
      "transport_address" : "10.229.16.67:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[.monitoring-kibana-7-2020.07.21][0], node[heV_9oFRRQy6_VWNiFhaBg], [P], s[STARTED], a[id=l7wnPB-ORFOAyw62WvMIoA]]"
        }
      ]
    }
  ]
}

Disk quota exceeded indicates that a write (or similar) syscall returned EDQUOT. In other words, this error is coming from the operating system and Elasticsearch is simply reporting it verbatim. Note that this is different from ENOSPC: you can have ample free space on a disk and still exceed some quota or other. If this isn't a quota you've configured then you'll need to speak with AWS support to identify which limit you're hitting.

Note that the reference manual recommends against using EFS for your data nodes.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.