Elasticsearch Shard Allocation - ALLOCATION_FAILED due to apparent disk quota issues, nowhere near max

DougR · July 21, 2020, 2:07pm

Environment

Elasticsearch 7.8.0 running in EKS
EFS filesystem
3 master nodes
5 data nodes
649 total shards

TL;DR

Receiving ALLOCATION_FAILED on shards due to apparent disk quota issues, should have more than enough disk space, also not exceeding max number of shards per data node, AFAIK.

Full Story

I'm having an issue with unallocated replica shards. I'm receiving an error of "Disk quota exceeded." I'm using EFS and am showing 8E disk space available. I currently have a total of 649 shards spread among 5 data nodes.

I executed this query:

GET /_cluster/allocation/explain
{
  "index": ".monitoring-kibana-7-2020.07.21",
  "shard": 0,
  "primary": false
}

...and got the following results:

{
  "index" : ".monitoring-kibana-7-2020.07.21",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-07-21T13:26:31.961Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "Ku69PlSpQ8miS0CTrGYMXQ",
      "node_name" : "elk-es-data-4",
      "transport_address" : "10.229.27.241:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "KwJM0xwQQMC2rrs4zxfqvw",
      "node_name" : "elk-es-data-0",
      "transport_address" : "10.229.25.185:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "co-W7AfbSwWmABjAtw1LnQ",
      "node_name" : "elk-es-data-2",
      "transport_address" : "10.229.11.165:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "hKKp7AOiRAKAI5EnB-83sQ",
      "node_name" : "elk-es-data-1",
      "transport_address" : "10.229.4.130:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "heV_9oFRRQy6_VWNiFhaBg",
      "node_name" : "elk-es-data-3",
      "transport_address" : "10.229.16.67:9300",
      "node_attributes" : {
        "ilm_phase" : "hot",
        "ml.machine_memory" : "8589934592",
        "ml.max_open_jobs" : "20",
        "xpack.installed" : "true",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-07-21T13:26:31.961Z], failed_attempts[5], failed_nodes[[KwJM0xwQQMC2rrs4zxfqvw, Ku69PlSpQ8miS0CTrGYMXQ]], delayed=false, details[failed shard on node [Ku69PlSpQ8miS0CTrGYMXQ]: failed recovery, failure RecoveryFailedException[[.monitoring-kibana-7-2020.07.21][0]: Recovery failed from {elk-es-data-3}{heV_9oFRRQy6_VWNiFhaBg}{AThiFSVtSWye8ppytYCLiA}{10.229.16.67}{10.229.16.67:9300}{dilrt}{ilm_phase=hot, ml.machine_memory=8589934592, ml.max_open_jobs=20, xpack.installed=true, transform.node=true} into {elk-es-data-4}{Ku69PlSpQ8miS0CTrGYMXQ}{BlN9xVdwRjaNphDHFW0PwQ}{10.229.27.241}{10.229.27.241:9300}{dilrt}{ml.machine_memory=8589934592, xpack.installed=true, transform.node=true, ilm_phase=hot, ml.max_open_jobs=20}]; nested: RemoteTransportException[[elk-es-data-3][10.229.16.67:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[elk-es-data-4][10.229.27.241:9300][internal:index/shard/recovery/clean_files]]; nested: IOException[Disk quota exceeded]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[.monitoring-kibana-7-2020.07.21][0], node[heV_9oFRRQy6_VWNiFhaBg], [P], s[STARTED], a[id=l7wnPB-ORFOAyw62WvMIoA]]"
        }
      ]
    }
  ]
}

DavidTurner · July 21, 2020, 2:42pm

Disk quota exceeded indicates that a write (or similar) syscall returned EDQUOT. In other words, this error is coming from the operating system and Elasticsearch is simply reporting it verbatim. Note that this is different from ENOSPC: you can have ample free space on a disk and still exceed some quota or other. If this isn't a quota you've configured then you'll need to speak with AWS support to identify which limit you're hitting.

Note that the reference manual recommends against using EFS for your data nodes.

system · August 18, 2020, 2:52pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
FS is not full, but ES reports: java.io.IOException: Disk quota exceeded Elasticsearch	7	5747	July 17, 2017
Shard allocation logic not taking disk size into account - why? Elasticsearch	12	967	July 6, 2017
ElasticSearch cluster allocation fails - disk threshold not met Elasticsearch	3	1653	July 5, 2017
Low disk watermark [15%] exceeded on Elasticsearch	13	5353	July 5, 2017
Shards Allocation problem Elasticsearch	4	1720	May 17, 2017

Elasticsearch Shard Allocation - ALLOCATION_FAILED due to apparent disk quota issues, nowhere near max

Environment

TL;DR

Full Story

Related topics