Elasticsearch shard allocation, uneven distribution of shards among nodes

Hi,
May I ask you for help with The problem with uneven allocation of shards on the elatsic cluster is repeated again.
This condition is causes the full nodes consumes 90% CPUs and the whole cluster starts to refuse the requests.

This problem happened few times before. Elastic cluster works correctly for a couple of months and then for unknown reasons it gets to the state that some nodes are low on shards but they take up the entire disk capacity.

This is caused by some nodes allocating larger shards and the cluster allocating shards to the nodes with the least number of shards rather than by disk utilisation.

Currently this happened on 3 nodes.

As a workaround, I use exclude node from cluster and then bring them back, but this can only be solved as long as I have enough capacity on the cluster.

I have elasticseach version 7.17.0
I don't have any specific attributes or awerness .

do you have any advice how to resolve this issue?
Should I upgrade to v8.x?

thank you


GET _cat/allocation?v&h=node,shards,disk.indices,disk.used, disk.avail,disk.total,disk.percent&s=node

node         shards disk.indices disk.used disk.total disk.percent
tela01prahkz    197        1.1tb     1.2tb      1.9tb           62
tela02prahkz    198        1.3tb     1.4tb      1.9tb           75
tela03prahkz    197      711.9gb   803.9gb      1.9tb           40
tela04prahkz    197        1.2tb     1.3tb      1.9tb           68
tela05prahkz    197        1.1tb     1.2tb      1.9tb           64
tela06prahkz    197        885gb     976gb      1.9tb           49
tela07prahkz    196      647.9gb   738.8gb      1.9tb           37
tela08prahkz    197      891.5gb   983.2gb      1.9tb           49
tela09prahkz    197          1tb     1.1tb      1.9tb           58
tela10prahkz    197        1.2tb     1.2tb      1.9tb           66
tela11prahkz    151        1.7tb     1.7tb      1.9tb           89 <====low shards but disk full
tela12prahkz    197        1.1tb     1.1tb      1.9tb           60
tela13prahkz    197      795.4gb   799.3gb      1.9tb           40
tela14prahkz    198      734.6gb     739gb      1.9tb           37
tela15prahkz    197        1.4tb     1.4tb      1.9tb           76
tela16prahkz    198        1.4tb     1.4tb      1.9tb           74	
tela17prahkz    198        1.3tb     1.3tb      1.9tb           68
tela18prahkz    196      723.2gb   727.6gb      1.9tb           36
tela19prahkz    154        1.6tb     1.6tb      1.9tb           86 <====low shards but disk full
tela20prahkz    197        1.5tb     1.5tb      1.9tb           78
tela21prahkz    197      933.3gb   938.5gb      1.9tb           47
tela22prahkz    197      993.9gb   999.2gb      1.9tb           50
tela23prahkz    196      952.1gb   957.1gb      1.9tb           48
tela24prahkz    197     1020.9gb       1tb      1.9tb           51
tela25prahkz    197      826.2gb     833gb      1.9tb           42
tela26prahkz    197          1tb       1tb      1.9tb           56
tela27prahkz    197        1.4tb     1.4tb      1.9tb           75
tela28prahkz    198          1tb       1tb      1.9tb           55
tela29prahkz    197        1.2tb     1.2tb      1.9tb           66
tela30prahkz    197      930.2gb   934.5gb      1.9tb           46
tela31prahkz    197      924.5gb   930.9gb      1.9tb           45
tela32prahkz    198      793.7gb   798.3gb      1.9tb           39
tela33prahkz    196        546gb     551gb      1.9tb           27
tela34prahkz    197      806.2gb     811gb      1.9tb           39
tela35prahkz    197      792.4gb   797.6gb      1.9tb           39
tela36prahkz    197        1.2tb     1.2tb      1.9tb           64
tela37prahkz    198        982gb   988.1gb      1.9tb           48
tela38prahkz    155        1.7tb     1.7tb      1.9tb           89 <====low shards but disk full

this issue I already reported here Elasticseach shards allocation

appretiate any advice thank you

and here is cluster stats

GET _cluster/stats?pretty&human
{
  "_nodes" : {
    "total" : 49,
    "successful" : 49,
    "failed" : 0
  },
  "cluster_name" : "o2-cz-cem",
  "cluster_uuid" : "OAIGGQ4QTqiz4i_tgdFx3g",
  "timestamp" : 1671527409789,
  "status" : "green",
  "indices" : {
    "count" : 2750,
    "shards" : {
      "total" : 7579,
      "primaries" : 4291,
      "replication" : 0.7662549522255885,
      "index" : {
        "shards" : {
          "min" : 1,
          "max" : 10,
          "avg" : 2.756
        },
        "primaries" : {
          "min" : 1,
          "max" : 10,
          "avg" : 1.5603636363636364
        },
        "replication" : {
          "min" : 0.0,
          "max" : 1.0,
          "avg" : 0.9225454545454546
        }
      }
    },
    "docs" : {
      "count" : 100377328903,
      "deleted" : 49381640
    },
    "store" : {
      "size" : "43.2tb",
      "size_in_bytes" : 47585991377166,
      "total_data_set_size" : "43.2tb",
      "total_data_set_size_in_bytes" : 47585991377166,
      "reserved" : "0b",
      "reserved_in_bytes" : 0
    },
    "fielddata" : {
      "memory_size" : "2.1gb",
      "memory_size_in_bytes" : 2360492980,
      "evictions" : 0
    },
    "query_cache" : {
      "memory_size" : "33gb",
      "memory_size_in_bytes" : 35441901188,
      "total_count" : 3528593277,
      "hit_count" : 42556885,
      "miss_count" : 3486036392,
      "cache_size" : 266506,
      "cache_count" : 1024634,
      "evictions" : 758128
    },
    "completion" : {
      "size" : "0b",
      "size_in_bytes" : 0
    },
    "segments" : {
      "count" : 91019,
      "memory" : "2gb",
      "memory_in_bytes" : 2196122264,
      "terms_memory" : "1.6gb",
      "terms_memory_in_bytes" : 1769496376,
      "stored_fields_memory" : "151.4mb",
      "stored_fields_memory_in_bytes" : 158817912,
      "term_vectors_memory" : "0b",
      "term_vectors_memory_in_bytes" : 0,
      "norms_memory" : "34.8mb",
      "norms_memory_in_bytes" : 36535744,
      "points_memory" : "0b",
      "points_memory_in_bytes" : 0,
      "doc_values_memory" : "220.5mb",
      "doc_values_memory_in_bytes" : 231272232,
      "index_writer_memory" : "1.1gb",
      "index_writer_memory_in_bytes" : 1217068942,
      "version_map_memory" : "836.7kb",
      "version_map_memory_in_bytes" : 856803,
      "fixed_bit_set" : "2.2gb",
      "fixed_bit_set_memory_in_bytes" : 2430196608,
      "max_unsafe_auto_id_timestamp" : 1671527327933,
      "file_sizes" : { }
    },
    "mappings" : {
      "field_types" : [
        {
          "name" : "alias",
          "count" : 61,
          "index_count" : 16,
          "script_count" : 0
        },
        {
          "name" : "binary",
          "count" : 8,
          "index_count" : 8,
          "script_count" : 0
        },
        {
          "name" : "boolean",
          "count" : 1796,
          "index_count" : 242,
          "script_count" : 0
        },
        {
          "name" : "byte",
          "count" : 17,
          "index_count" : 17,
          "script_count" : 0
        },
        {
          "name" : "constant_keyword",
          "count" : 52,
          "index_count" : 18,
          "script_count" : 0
        },
        {
          "name" : "date",
          "count" : 7306,
          "index_count" : 2458,
          "script_count" : 0
        },
        {
          "name" : "date_nanos",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "date_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "double",
          "count" : 4319,
          "index_count" : 19,
          "script_count" : 0
        },
        {
          "name" : "double_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "flattened",
          "count" : 168,
          "index_count" : 14,
          "script_count" : 0
        },
        {
          "name" : "float",
          "count" : 4058,
          "index_count" : 310,
          "script_count" : 0
        },
        {
          "name" : "float_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "geo_point",
          "count" : 404,
          "index_count" : 175,
          "script_count" : 0
        },
        {
          "name" : "geo_shape",
          "count" : 5,
          "index_count" : 5,
          "script_count" : 0
        },
        {
          "name" : "half_float",
          "count" : 72,
          "index_count" : 16,
          "script_count" : 0
        },
        {
          "name" : "integer",
          "count" : 1389,
          "index_count" : 347,
          "script_count" : 0
        },
        {
          "name" : "integer_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "ip",
          "count" : 330,
          "index_count" : 21,
          "script_count" : 0
        },
        {
          "name" : "ip_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "keyword",
          "count" : 167941,
          "index_count" : 2354,
          "script_count" : 0
        },
        {
          "name" : "long",
          "count" : 50036,
          "index_count" : 1578,
          "script_count" : 0
        },
        {
          "name" : "long_range",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "match_only_text",
          "count" : 910,
          "index_count" : 14,
          "script_count" : 0
        },
        {
          "name" : "nested",
          "count" : 331,
          "index_count" : 60,
          "script_count" : 0
        },
        {
          "name" : "object",
          "count" : 80035,
          "index_count" : 674,
          "script_count" : 0
        },
        {
          "name" : "scaled_float",
          "count" : 2030,
          "index_count" : 14,
          "script_count" : 0
        },
        {
          "name" : "shape",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "short",
          "count" : 2,
          "index_count" : 2,
          "script_count" : 0
        },
        {
          "name" : "text",
          "count" : 57790,
          "index_count" : 1056,
          "script_count" : 0
        },
        {
          "name" : "unsigned_long",
          "count" : 76,
          "index_count" : 38,
          "script_count" : 0
        },
        {
          "name" : "version",
          "count" : 4,
          "index_count" : 4,
          "script_count" : 0
        },
        {
          "name" : "wildcard",
          "count" : 238,
          "index_count" : 14,
          "script_count" : 0
        }
      ],
      "runtime_field_types" : [
        {
          "name" : "keyword",
          "count" : 23,
          "index_count" : 12,
          "scriptless_count" : 23,
          "shadowed_count" : 23,
          "lang" : [ ],
          "lines_max" : 0,
          "lines_total" : 0,
          "chars_max" : 0,
          "chars_total" : 0,
          "source_max" : 0,
          "source_total" : 0,
          "doc_max" : 0,
          "doc_total" : 0
        }
      ]
    },
    "analysis" : {
      "char_filter_types" : [ ],
      "tokenizer_types" : [ ],
      "filter_types" : [ ],
      "analyzer_types" : [ ],
      "built_in_char_filters" : [ ],
      "built_in_tokenizers" : [ ],
      "built_in_filters" : [ ],
      "built_in_analyzers" : [ ]
    },
    "versions" : [
      {
        "version" : "6.4.2",
        "index_count" : 14,
        "primary_shard_count" : 14,
        "total_primary_size" : "177mb",
        "total_primary_bytes" : 185689169
      },
      {
        "version" : "6.8.5",
        "index_count" : 2,
        "primary_shard_count" : 2,
        "total_primary_size" : "44.3mb",
        "total_primary_bytes" : 46507523
      },
      {
        "version" : "7.4.2",
        "index_count" : 54,
        "primary_shard_count" : 54,
        "total_primary_size" : "2.5gb",
        "total_primary_bytes" : 2697595563
      },
      {
        "version" : "7.9.0",
        "index_count" : 392,
        "primary_shard_count" : 401,
        "total_primary_size" : "98.5gb",
        "total_primary_bytes" : 105801537104
      },
      {
        "version" : "7.13.3",
        "index_count" : 383,
        "primary_shard_count" : 403,
        "total_primary_size" : "116.1gb",
        "total_primary_bytes" : 124752442293
      },
      {
        "version" : "7.17.0",
        "index_count" : 1905,
        "primary_shard_count" : 3417,
        "total_primary_size" : "29.7tb",
        "total_primary_bytes" : 32678930501899
      }
    ]
  },
  "nodes" : {
    "count" : {
      "total" : 49,
      "coordinating_only" : 2,
      "data" : 0,
      "data_cold" : 0,
      "data_content" : 38,
      "data_frozen" : 0,
      "data_hot" : 38,
      "data_warm" : 4,
      "ingest" : 42,
      "master" : 3,
      "ml" : 44,
      "remote_cluster_client" : 2,
      "transform" : 42,
      "voting_only" : 0
    },
    "versions" : [
      "7.17.0"
    ],
    "os" : {
      "available_processors" : 356,
      "allocated_processors" : 356,
      "names" : [
        {
          "name" : "Linux",
          "count" : 49
        }
      ],
      "pretty_names" : [
        {
          "pretty_name" : "Ubuntu 20.04.3 LTS",
          "count" : 1
        },
        {
          "pretty_name" : "CentOS Linux 7 (Core)",
          "count" : 48
        }
      ],
      "architectures" : [
        {
          "arch" : "amd64",
          "count" : 49
        }
      ],
      "mem" : {
        "total" : "1.4tb",
        "total_in_bytes" : 1619387199488,
        "free" : "61gb",
        "free_in_bytes" : 65576976384,
        "used" : "1.4tb",
        "used_in_bytes" : 1553810223104,
        "free_percent" : 4,
        "used_percent" : 96
      }
    },
    "process" : {
      "cpu" : {
        "percent" : 601
      },
      "open_file_descriptors" : {
        "min" : 1331,
        "max" : 3635,
        "avg" : 2913
      }
    },
    "jvm" : {
      "max_uptime" : "319.6d",
      "max_uptime_in_millis" : 27620276206,
      "versions" : [
        {
          "version" : "17.0.1",
          "vm_name" : "OpenJDK 64-Bit Server VM",
          "vm_version" : "17.0.1+12",
          "vm_vendor" : "Eclipse Adoptium",
          "bundled_jdk" : true,
          "using_bundled_jdk" : true,
          "count" : 49
        }
      ],
      "mem" : {
        "heap_used" : "287.8gb",
        "heap_used_in_bytes" : 309076183984,
        "heap_max" : "672gb",
        "heap_max_in_bytes" : 721554505728
      },
      "threads" : 7517
    },
    "fs" : {
      "total" : "82tb",
      "total_in_bytes" : 90232745926656,
      "free" : "38tb",
      "free_in_bytes" : 41854273970176,
      "available" : "37.2tb",
      "available_in_bytes" : 41009207721984
    },
    "plugins" : [ ],
    "network_types" : {
      "transport_types" : {
        "security4" : 49
      },
      "http_types" : {
        "security4" : 49
      }
    },
    "discovery_types" : {
      "zen" : 49
    },
    "packaging_types" : [
      {
        "flavor" : "default",
        "type" : "rpm",
        "count" : 48
      },
      {
        "flavor" : "default",
        "type" : "docker",
        "count" : 1
      }
    ],
    "ingest" : {
      "number_of_pipelines" : 23,
      "processor_stats" : {
        "conditional" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "convert" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "geoip" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "grok" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "gsub" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "remove" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "rename" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "script" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "set" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        },
        "set_security_user" : {
          "count" : 0,
          "failed" : 0,
          "current" : 0,
          "time" : "0s",
          "time_in_millis" : 0
        }
      }
    }
  }
}

I temporarily resolved the issue again by removing and adding nodes back to cluster (in few hours)

if I look at allocation now
we could see tha shards are evenly distributed among nodes but storage not capacity balance Some nodes are 32% utilised but other more than 80%.

GET _cat/allocation?v&h=node,shards,disk.indices,disk.used, disk.avail,disk.total,disk.percent&s=disk.percent
node         shards disk.indices disk.used disk.total disk.percent
tela33prahkz    197      656.9gb   661.2gb      1.9tb           32
tela07prahkz    197      673.8gb     765gb      1.9tb           38
tela34prahkz    197      793.3gb   797.8gb      1.9tb           39
tela32prahkz    197      818.6gb   823.9gb      1.9tb           40
tela14prahkz    197      816.1gb   820.4gb      1.9tb           41
tela18prahkz    197        813gb     817gb      1.9tb           41
tela13prahkz    197      843.5gb     848gb      1.9tb           42
tela35prahkz    197        888gb   891.9gb      1.9tb           43
tela30prahkz    197      951.3gb   955.5gb      1.9tb           47
tela03prahkz    197      839.3gb   930.7gb      1.9tb           47
tela31prahkz    197      996.4gb  1000.6gb      1.9tb           49
tela21prahkz    198      983.3gb   989.6gb      1.9tb           49
tela22prahkz    197     1007.8gb  1013.1gb      1.9tb           51
tela08prahkz    197      953.2gb       1tb      1.9tb           53
tela37prahkz    198          1tb       1tb      1.9tb           54
tela23prahkz    197          1tb       1tb      1.9tb           54
tela25prahkz    197          1tb       1tb      1.9tb           54
tela24prahkz    197          1tb       1tb      1.9tb           56
tela06prahkz    197     1014.6gb       1tb      1.9tb           56
tela09prahkz    198          1tb     1.1tb      1.9tb           60
tela10prahkz    198        1.1tb     1.1tb      1.9tb           61
tela17prahkz    198        1.1tb     1.1tb      1.9tb           61
tela28prahkz    197        1.2tb     1.2tb      1.9tb           62
tela26prahkz    198        1.2tb     1.2tb      1.9tb           62
tela01prahkz    198        1.1tb     1.2tb      1.9tb           64
tela12prahkz    197        1.3tb     1.3tb      1.9tb           67
tela05prahkz    198        1.2tb     1.3tb      1.9tb           68
tela36prahkz    198        1.3tb     1.3tb      1.9tb           68
tela15prahkz    198        1.3tb     1.3tb      1.9tb           69
tela29prahkz    198        1.3tb     1.3tb      1.9tb           72
tela27prahkz    198        1.4tb     1.4tb      1.9tb           73
tela38prahkz    198        1.4tb     1.4tb      1.9tb           75
tela16prahkz    198        1.4tb     1.4tb      1.9tb           76
tela02prahkz    198        1.3tb     1.4tb      1.9tb           77
tela04prahkz    198        1.4tb     1.5tb      1.9tb           78
tela20prahkz    198        1.5tb     1.5tb      1.9tb           78
tela19prahkz    198        1.5tb     1.5tb      1.9tb           80
tela11prahkz    197        1.5tb     1.5tb      1.9tb           81

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.