New warm nodes are getting filled fast

Hi All,

In the beginning, we had 4 data nodes. Two were ilm designated hot nodes and the other two were warm nodes. Newly created indices would be in the hot nodes under the hot phase and after 4 weeks of retention, would move to the warm nodes under the warm phase.

Recently we added two more warm nodes to the cluster. But what we have been noticing is that the indices are prominently populating these new warm nodes instead of getting evenly distributed across all the 4 warm nodes. How can I make the indices to stop populating only the new warm nodes?

Can you share your ILM policy and the config from the warm nodes?

We don't have a set ILM policy as of now, because the naming for the majority of our indices are static and not dynamic. So we manually run the command

PUT *2021.<week_number>/_settings
{
  "index.routing.allocation.require.data": "warm"
}

every week to rollover indices from hot to warm phase.

Following is the config of our new warm node:

cluster.name: "**"
node.name: "data-4"
path.logs: /var/log/elasticsearch
path.data: /datadisks/disk2/elasticsearch/data
discovery.zen.ping.unicast.hosts: ["master-0:9300","master-1:9300","master-2:9300"]
node.master: false
node.data: true
node.attr.data: warm
discovery.zen.minimum_master_nodes: 2
network.host: [site, local]
node.max_local_storage_nodes: 1
#node.attr.fault_domain:
#node.attr.update_domain:
#cluster.routing.allocation.awareness.attributes: fault_domain,update_domain
#xpack.license.self_generated.type: trial
xpack.security.enabled: true
bootstrap.memory_lock: false

xpack.security.authc.token.enabled: true

xpack.security.authc.realms.native1:
type: native
order: 0

xpack.security.authc.realms.saml1:
type: saml
order: 2
idp.metadata.path: saml/idp-external.xml
idp.entity_id: "^^^"
sp.entity_id: ""
sp.acs: "^^^"
sp.logout: "
"
attributes.principal: "***"
attributes.groups: "http://schemas.microsoft.com/ws/2008/06/identity/claims/groups"

xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.key: ssl/data-4.key
xpack.security.http.ssl.certificate: ssl/data-4.crt
#xpack.security.http.ssl.key_passphrase: ***

xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.keystore.path: certs/elastic-certificates.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-certificates.p12
xpack.notification.email.account:
standard_account:
profile: standard
smtp:
auth: false
starttls.enable: false

It looks like you have a license above Basic, based on your use of Watcher and AD in Security, so I would encourage you to reach out your Support contact about this as well.

What's the output from;

GET /_cat/allocation?v
GET /_cat/nodeattrs?v
shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
   830          1tb     1.1tb    804.6gb      1.9tb           60 10.40.10.58 10.40.10.58 cle-data-3
   830          1tb     1.1tb    798.2gb      1.9tb           60 10.40.10.57 10.40.10.57 cle-data-2
   830        1.4tb     1.5tb    383.6gb      1.9tb           80 10.40.10.59 10.40.10.59 cle-data-5
  1005        1.3tb     1.4tb    482.7gb      1.9tb           76 10.40.10.8  10.40.10.8  cle-data-0
   830        1.5tb     1.6tb    344.5gb      1.9tb           82 10.40.10.60 10.40.10.60 cle-data-4
  1005        1.3tb     1.5tb    479.5gb      1.9tb           76 10.40.10.6  10.40.10.6  cle-data-1

cle-master-2 ml.machine_memory 3608973312
cle-master-2 ml.max_open_jobs  20
cle-master-2 xpack.installed   true
cle-master-2 ml.enabled        true
cle-data-2   ml.machine_memory 14706561024
cle-data-2   ml.max_open_jobs  20
cle-data-2   xpack.installed   true
cle-data-2   ml.enabled        true
cle-data-2   data              warm
cle-data-1   ml.machine_memory 59094614016
cle-data-1   ml.max_open_jobs  20
cle-data-1   xpack.installed   true
cle-data-1   ml.enabled        true
cle-data-1   data              hot
cle-data-3   ml.machine_memory 14706561024
cle-data-3   ml.max_open_jobs  20
cle-data-3   xpack.installed   true
cle-data-3   ml.enabled        true
cle-data-3   data              warm
cle-data-4   ml.machine_memory 14677934080
cle-data-4   ml.max_open_jobs  20
cle-data-4   xpack.installed   true
cle-data-4   ml.enabled        true
cle-data-4   data              warm
cle-data-5   ml.machine_memory 14677934080
cle-data-5   ml.max_open_jobs  20
cle-data-5   xpack.installed   true
cle-data-5   ml.enabled        true
cle-data-5   data              warm
cle-master-0 ml.machine_memory 3608965120
cle-master-0 ml.max_open_jobs  20
cle-master-0 xpack.installed   true
cle-master-0 ml.enabled        true
cle-master-1 ml.machine_memory 3608973312
cle-master-1 ml.max_open_jobs  20
cle-master-1 xpack.installed   true
cle-master-1 ml.enabled        true
cle-data-0   ml.machine_memory 59094618112
cle-data-0   ml.max_open_jobs  20
cle-data-0   xpack.installed   true
cle-data-0   ml.enabled        true
cle-data-0   data              hot

Just a note, you have way too many shards for your data size and are likely overloading your nodes. You should shrink some of your indices if you can.

However looking at that output I can see;

cle-data-2   data              warm
cle-data-3   data              warm
cle-data-4   data              warm
cle-data-5   data              warm

And;

shards disk.indices disk.used disk.avail disk.total disk.percent host        ip          node
   830          1tb     1.1tb    804.6gb      1.9tb           60 10.40.10.58 10.40.10.58 cle-data-3
   830          1tb     1.1tb    798.2gb      1.9tb           60 10.40.10.57 10.40.10.57 cle-data-2
   830        1.4tb     1.5tb    383.6gb      1.9tb           80 10.40.10.59 10.40.10.59 cle-data-5
   830        1.5tb     1.6tb    344.5gb      1.9tb           82 10.40.10.60 10.40.10.60 cle-data-4

So all of those nodes have the same shard count on them. It does look like the shard sizes are different though, which would account for what you are seeing.

However ES balances by shard count first, then if it starts to hit disk watermarks then it will move things around as needed.