Allocation Failed

Hi,

I've got many error messages like that:

{
  "index" : "logstash-prod_operations_clear-001098",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2023-03-07T09:23:48.390Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [8zemZfddQOm3iNFio-GgsA]: failed recovery, failure RecoveryFailedException[[logstash-prod_operations_clear-001098][0]: Recovery failed from {elastic-cold-com-15-rz1}{UTpMrVw1SW6PAJlgjzEbAg}{ng25sO6yTmyxVSoJighbZQ}{10.1.6.87}{10.1.6.87:9300}{cdfhrstw}{rz=rz1, xpack.installed=true, storage=hdd, transform.node=true} into {elastic-cold-com-16-rz2}{8zemZfddQOm3iNFio-GgsA}{7RT9r9OeQbKZ0qz7E7eECw}{10.2.6.87}{10.2.6.87:9300}{cdfhrstw}{xpack.installed=true, transform.node=true, rz=rz2, storage=hdd}]; nested: RemoteTransportException[[elastic-cold-com-15-rz1][10.1.6.87:9300][internal:index/shard/recovery/start_recovery]]; nested: CircuitBreakingException[[parent] Data too large, data for [internal:index/shard/recovery/start_recovery] would be [6363867618/5.9gb], which is larger than the limit of [6120328396/5.6gb], real usage: [6363866232/5.9gb], new bytes reserved: [1386/1.3kb], usages [request=8736/8.5kb, fielddata=144290/140.9kb, in_flight_requests=1386/1.3kb, model_inference=0/0b, accounting=578797672/551.9mb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "awaiting_info",
  "allocate_explanation" : "cannot allocate because information about existing shard data is still being retrieved from some of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "-LZu73W2TgOa87dU87Nx0A",
      "node_name" : "elastic-cold-com-22-rz2",
      "transport_address" : "10.2.6.91:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "3jgTzoQdQJSpQaLaQhlMPg",
      "node_name" : "elastic-cold-com-4-rz2",
      "transport_address" : "10.2.6.78:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "6fUpig1RQc-161P30ZG1CA",
      "node_name" : "elastic-cold-com-6-rz2",
      "transport_address" : "10.2.6.79:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "8zemZfddQOm3iNFio-GgsA",
      "node_name" : "elastic-cold-com-16-rz2",
      "transport_address" : "10.2.6.87:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "CE_ICelORTCTTNv8GzcICg",
      "node_name" : "elastic-cold-com-8-rz2",
      "transport_address" : "10.2.6.82:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "P3qZHvSQTReIUUtUw-J6iA",
      "node_name" : "elastic-cold-com-2-rz2",
      "transport_address" : "10.2.6.77:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "WTFqNa4zTh-dHsp7aryo2w",
      "node_name" : "elastic-cold-com-18-rz2",
      "transport_address" : "10.2.6.88:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "Wv3DS25qTimNewVFmS-L_A",
      "node_name" : "elastic-cold-com-10-rz2",
      "transport_address" : "10.2.6.84:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "diEioxR1RRSHP8GEg9Vn8g",
      "node_name" : "elastic-cold-com-12-rz2",
      "transport_address" : "10.2.6.85:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "yes"
    },
    {
      "node_id" : "1d0fBM28Q2W9J3lAXg6BEA",
      "node_name" : "elastic-cold-com-20-rz2",
      "transport_address" : "10.2.6.89:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "throttled",
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },
    {
      "node_id" : "KLV3tbJvRmGC8GdgpXV0vQ",
      "node_name" : "elastic-cold-com-14-rz2",
      "transport_address" : "10.2.6.86:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "throttled",
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        }
      ]
    },
    {
      "node_id" : "4aylJLz8SJ2P64ciJVtPtg",
      "node_name" : "elastic-cold-com-9-rz1",
      "transport_address" : "10.1.6.84:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "7nMPX7b-T8KDLWIkqQ8afg",
      "node_name" : "elastic-cold-com-13-rz1",
      "transport_address" : "10.1.6.86:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "9pz0HbzKQhi05UW9Zpb-ng",
      "node_name" : "elastic-hot-com-6-rz2",
      "transport_address" : "10.2.6.74:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "ssd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [storage:"hdd"]"""
        }
      ]
    },
    {
      "node_id" : "H-HREaclQce2POdNnx7-MQ",
      "node_name" : "elastic-cold-com-11-rz1",
      "transport_address" : "10.1.6.85:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "JivunSuJRfucH1OkjOsxWw",
      "node_name" : "elastic-cold-com-3-rz1",
      "transport_address" : "10.1.6.78:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "NH8YRPyZS-OzJVs-Qn8wCg",
      "node_name" : "elastic-cold-com-5-rz1",
      "transport_address" : "10.1.6.79:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "TggXs0TRQx2MRqYjve7YHw",
      "node_name" : "elastic-cold-com-21-rz1",
      "transport_address" : "10.1.6.90:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=90%], using more disk space than the maximum allowed [90.0%], actual free: [7.483555394733263%]"
        },
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "TndIMiQSTS-hJkapNLJHSw",
      "node_name" : "elastic-hot-com-4-rz2",
      "transport_address" : "10.2.6.76:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "ssd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [storage:"hdd"]"""
        }
      ]
    },
    {
      "node_id" : "Tp-57eKqTByni5z3Sy8_Aw",
      "node_name" : "elastic-cold-com-17-rz1",
      "transport_address" : "10.1.6.88:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "UTpMrVw1SW6PAJlgjzEbAg",
      "node_name" : "elastic-cold-com-15-rz1",
      "transport_address" : "10.1.6.87:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node [[logstash-prod_operations_clear-001098][0], node[UTpMrVw1SW6PAJlgjzEbAg], [P], s[STARTED], a[id=ZjhFWijkTuuhoRiBUDsDAA]]"
        },
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        },
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "bN-qQ0iXTEm7zSMXX3jVYg",
      "node_name" : "elastic-cold-com-7-rz1",
      "transport_address" : "10.1.6.82:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "dpuOVN5qR2qlGHTT3RXylQ",
      "node_name" : "elastic-cold-com-23-rz1",
      "transport_address" : "10.1.6.91:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "throttling",
          "decision" : "THROTTLE",
          "explanation" : "reached the limit of incoming shard recoveries [2], cluster setting [cluster.routing.allocation.node_concurrent_incoming_recoveries=2] (can also be set via [cluster.routing.allocation.node_concurrent_recoveries])"
        },
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "fkkU0UULQtO-ETf2QcN8ww",
      "node_name" : "elastic-cold-com-19-rz1",
      "transport_address" : "10.1.6.89:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "hdd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "luOxP-ExTi26KigJb8J0ng",
      "node_name" : "elastic-hot-com-3-rz1",
      "transport_address" : "10.1.6.76:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "ssd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [storage:"hdd"]"""
        },
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "mPXUU9zNQrKic0sPxz7pVQ",
      "node_name" : "elastic-hot-com-2-rz2",
      "transport_address" : "10.2.6.75:9300",
      "node_attributes" : {
        "rz" : "rz2",
        "xpack.installed" : "true",
        "storage" : "ssd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [storage:"hdd"]"""
        }
      ]
    },
    {
      "node_id" : "oDPOAMY2RQWmdpqDqONOzA",
      "node_name" : "elastic-hot-com-5-rz1",
      "transport_address" : "10.1.6.77:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "ssd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [storage:"hdd"]"""
        },
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    },
    {
      "node_id" : "w3GlzXS1T66aVrZJPftZ9A",
      "node_name" : "elastic-hot-com-1-rz1",
      "transport_address" : "10.1.6.75:9300",
      "node_attributes" : {
        "rz" : "rz1",
        "xpack.installed" : "true",
        "storage" : "ssd",
        "transform.node" : "true"
      },
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "filter",
          "decision" : "NO",
          "explanation" : """node does not match index setting [index.routing.allocation.require] filters [storage:"hdd"]"""
        },
        {
          "decider" : "awareness",
          "decision" : "NO",
          "explanation" : "there are too many copies of the shard allocated to nodes with attribute [rz], there are [2] total configured shard copies for this shard id and [2] total attribute values, expected the allocated shard count per attribute [2] to be less than or equal to the upper bound of the required number of shards per attribute [1]"
        }
      ]
    }
  ]
}

Can someone give me an advice, what can I do?

You're using a relatively old version of ES, and in newer versions the message now reads as follows:

Elasticsearch is retrieving information about this shard from one or more
nodes. It will make an allocation decision after it receives this
information. Please wait.

As it says, you just have to wait.

Hey David,

thanks for the quick response.
But the problem is, when I start a request in elasticsearch "discover" it shows for over 4 weeks "shards failed" and I've got very often wrong or incomplete log results.

So I often wait for a long time, but ES shows wrong results.

What is the full output of the cluster stats API?

Hi Christian,

{
  "_nodes" : {
    "total" : 31,
    "successful" : 25,
    "failed" : 6,
    "failures" : [
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [fkkU0UULQtO-ETf2QcN8ww]",
        "node_id" : "fkkU0UULQtO-ETf2QcN8ww",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/stats[n]] would be [6345608124/5.9gb], which is larger than the limit of [6120328396/5.6gb], real usage: [6345590496/5.9gb], new bytes reserved: [17628/17.2kb], usages [request=10176/9.9kb, fielddata=127728/124.7kb, in_flight_requests=17628/17.2kb, model_inference=0/0b, accounting=577140864/550.4mb]",
          "bytes_wanted" : 6345608124,
          "bytes_limit" : 6120328396,
          "durability" : "PERMANENT"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [dpuOVN5qR2qlGHTT3RXylQ]",
        "node_id" : "dpuOVN5qR2qlGHTT3RXylQ",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/stats[n]] would be [6356287508/5.9gb], which is larger than the limit of [6120328396/5.6gb], real usage: [6356269880/5.9gb], new bytes reserved: [17628/17.2kb], usages [request=8960/8.7kb, fielddata=127134/124.1kb, in_flight_requests=17628/17.2kb, model_inference=0/0b, accounting=609030408/580.8mb]",
          "bytes_wanted" : 6356287508,
          "bytes_limit" : 6120328396,
          "durability" : "PERMANENT"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [TggXs0TRQx2MRqYjve7YHw]",
        "node_id" : "TggXs0TRQx2MRqYjve7YHw",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/stats[n]] would be [6364564980/5.9gb], which is larger than the limit of [6120328396/5.6gb], real usage: [6364547352/5.9gb], new bytes reserved: [17628/17.2kb], usages [request=2832/2.7kb, fielddata=127794/124.7kb, in_flight_requests=19000/18.5kb, model_inference=0/0b, accounting=602872088/574.9mb]",
          "bytes_wanted" : 6364564980,
          "bytes_limit" : 6120328396,
          "durability" : "PERMANENT"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [UTpMrVw1SW6PAJlgjzEbAg]",
        "node_id" : "UTpMrVw1SW6PAJlgjzEbAg",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/stats[n]] would be [6333789116/5.8gb], which is larger than the limit of [6120328396/5.6gb], real usage: [6333771488/5.8gb], new bytes reserved: [17628/17.2kb], usages [request=14824/14.4kb, fielddata=111174/108.5kb, in_flight_requests=17628/17.2kb, model_inference=0/0b, accounting=577661880/550.9mb]",
          "bytes_wanted" : 6333789116,
          "bytes_limit" : 6120328396,
          "durability" : "PERMANENT"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [diEioxR1RRSHP8GEg9Vn8g]",
        "node_id" : "diEioxR1RRSHP8GEg9Vn8g",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/stats[n]] would be [6267052180/5.8gb], which is larger than the limit of [6120328396/5.6gb], real usage: [6267034552/5.8gb], new bytes reserved: [17628/17.2kb], usages [request=28384/27.7kb, fielddata=44611/43.5kb, in_flight_requests=19000/18.5kb, model_inference=0/0b, accounting=585680200/558.5mb]",
          "bytes_wanted" : 6267052180,
          "bytes_limit" : 6120328396,
          "durability" : "PERMANENT"
        }
      },
      {
        "type" : "failed_node_exception",
        "reason" : "Failed node [7nMPX7b-T8KDLWIkqQ8afg]",
        "node_id" : "7nMPX7b-T8KDLWIkqQ8afg",
        "caused_by" : {
          "type" : "circuit_breaking_exception",
          "reason" : "[parent] Data too large, data for [cluster:monitor/stats[n]] would be [6296061804/5.8gb], which is larger than the limit of [6120328396/5.6gb], real usage: [6296044176/5.8gb], new bytes reserved: [17628/17.2kb], usages [request=8656/8.4kb, fielddata=91200/89kb, in_flight_requests=17628/17.2kb, model_inference=0/0b, accounting=579387800/552.5mb]",
          "bytes_wanted" : 6296061804,
          "bytes_limit" : 6120328396,
          "durability" : "PERMANENT"
        }
      }
    ]
  },
  "cluster_name" : "elastic-com-c1",
  "cluster_uuid" : "pk3BirT2SB-Z5MAPU5vC8A",
  "timestamp" : 1679331894949,
  "status" : "yellow",
  "indices" : {
    "count" : 2068,
    "shards" : {
      "total" : 16004,
      "primaries" : 7878,
      "replication" : 1.0314800710840315,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 16,
          "avg" : 7.738878143133462
        },
        "primaries" : {
          "min" : 0,
          "max" : 8,
          "avg" : 3.8094777562862667
        },
        "replication" : {
          "min" : 0.0,
          "max" : 5.0,
          "avg" : 0.8762532237266313
        }
      }
    },
    "docs" : {
      "count" : 14729249991,
      "deleted" : 399380
    },
    "store" : {
      "size_in_bytes" : 24956558350747,
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size_in_bytes" : 2013896,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size_in_bytes" : 188214602,
      "total_count" : 322519208,
      "hit_count" : 6450503,
      "miss_count" : 316068705,
      "cache_size" : 220988,
      "cache_count" : 486371,
      "evictions" : 265383
    },
    "completion" : {
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 280470,
      "memory_in_bytes" : 11468541452,
      "terms_memory_in_bytes" : 11203792224,
      "stored_fields_memory_in_bytes" : 150720048,
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory_in_bytes" : 71232,
      "points_memory_in_bytes" : 0,
      "doc_values_memory_in_bytes" : 113957948,
      "index_writer_memory_in_bytes" : 1069367056,
      "version_map_memory_in_bytes" : 2465,
      "fixed_bit_set_memory_in_bytes" : 45496,
      "max_unsafe_auto_id_timestamp" : 1679331682680,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "boolean",
          "count" : 934,
          "index_count" : 934
        },
        {
          "name" : "date",
          "count" : 2126,
          "index_count" : 2088
        },
        {
          "name" : "double",
          "count" : 1,
          "index_count" : 1
        },
        {
          "name" : "geo_point",
          "count" : 2074,
          "index_count" : 2074
        },
        {
          "name" : "half_float",
          "count" : 4146,
          "index_count" : 2073
        },
        {
          "name" : "ip",
          "count" : 2075,
          "index_count" : 2074
        },
        {
          "name" : "keyword",
          "count" : 140489,
          "index_count" : 2088
        },
        {
          "name" : "long",
          "count" : 1675,
          "index_count" : 1673
        },
        {
          "name" : "nested",
          "count" : 12,
          "index_count" : 12
        },
        {
          "name" : "object",
          "count" : 27974,
          "index_count" : 2088
        },
        {
          "name" : "text",
          "count" : 139937,
          "index_count" : 1673
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "7.6.2",
        "index_count" : 5,
        "primary_shard_count" : 5,
        "total_primary_bytes" : 511534
      },
      {
        "version" : "7.7.1",
        "index_count" : 5,
        "primary_shard_count" : 5,
        "total_primary_bytes" : 2344624
      },
      {
        "version" : "7.8.0",
        "index_count" : 29,
        "primary_shard_count" : 29,
        "total_primary_bytes" : 107128695
      },
      {
        "version" : "7.12.0",
        "index_count" : 2121,
        "primary_shard_count" : 10768,
        "total_primary_bytes" : 12106961132047
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 25,
      "coordinating_only" : 0,
      "data" : 22,
      "data_cold" : 22,
      "data_content" : 22,
      "data_frozen" : 22,
      "data_hot" : 22,
      "data_warm" : 22,
      "ingest" : 6,
      "master" : 3,
      "ml" : 0,
      "remote_cluster_client" : 25,
      "transform" : 22,
      "voting_only" : 0
    },
    "versions" : [
      "7.12.0"
    ],
    "os" : {
      "available_processors" : 204,
      "allocated_processors" : 204,
      "names" : [
        {
          "name" : "Linux",
          "count" : 25
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Debian GNU/Linux 10 (buster)",
          "count" : 25
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 25
        }
      ],
      "mem" : {
        "total_in_bytes" : 486110662656,
        "free_in_bytes" : 27505487872,
        "used_in_bytes" : 458605174784,
        "free_percent" : 6,
        "used_percent" : 94
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 156
      },
      "open_file_descriptors" : {
        "min" : 1037,
        "max" : 12189,
        "avg" : 8705
      }
    },
    "jvm" : {
      "max_uptime_in_millis" : 30797566735,
      "versions" : [
        {
          "version" : "15.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "15.0.1+9",
          "vm_vendor" : "AdoptOpenJDK",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 25
        }
      ],
      "mem" : {
        "heap_used_in_bytes" : 172233827976,
        "heap_max_in_bytes" : 253403070464
      },
      "threads" : 3952
    },
    "fs" : {
      "total_in_bytes" : 38483938881536,
      "free_in_bytes" : 12847313604608,
      "available_in_bytes" : 11278792450048
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 25
      },
      "http_types" : {
        "security4" : 25
      }
    },
    "discovery_types" : {
      "zen" : 25
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "deb",
        "count" : 25
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 2,
      "processor_stats" : {
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time_in_millis" : 0
        }
      }
    }
  }
}

It looks like a number of nodes may have an issue with heap space as data collection from these failed due to circuit breaker error. It looks like you have around 10GB heap assigned and a very large number of reasonably small shards (around 1.5GB average size if I calculated correctly).

If nodes suffer from long GC (is there anything in the logs?) data may be relocated due to this. Given the large number of shards this can take a while, depending on the load the cluster is under and the amount of resources available, especially disk I/O.

Do you have any entries in the Elasticsearch logs around nodes leaving and rejoining the cluster, e.g. due to long GC or other issues?

Also ...

... this version is just coming up to 2 years old, and well past EOL, you should upgrade to a supported version as a matter of urgency.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.