ILMs in cluster not performing rollover when conditions are met

Hello! We have an Elasticsearch cluster running 8.19.3 version (master, hot, cold and coordinating nodes) and we’re seeing some issues regarding our ILM policies. We are using data streams so we’ve configured several ILM policies to handle data rotation. In the hot phase we explicitly configure that we want to rollover a data stream when it reaches a 50gb size or it reaches 30 days old. The thing is that Elasticsearch does not seem to be doing it half of the times. We see this for all of our ILMs, I’ll leave an example below.

This is a shard that’s over 50gb:

# GET _cat/shards/.ds*?v&s=store,store:desc

...
.ds-logstash-$some-index-2025.11.10-000026                                                               2     r      STARTED     28659163   56.4gb   56.4gb $ip_address $elasticsearch_node

We have cases where shards reach over to 70gb or even 100gb and they do not get rollover, I do not have a bigger example at the moment since I had to rollover them yesterday by hand because they cause performance issues.

This are our related ILM and templates configurations for that index, nothing seems to be odd:

# GET .ds-logstash-$some-index-2025.11.10-000026/_ilm/explain

{
  "indices": {
    ".ds-logstash-$some-index-2025.11.10-000026": {
      "index": ".ds-logstash-$some-index-2025.11.10-000026",
      "managed": true,
      "policy": "7warm15cold40delete",
      "index_creation_date_millis": 1762815974348,
      "time_since_index_creation": "15.97h",
      "lifecycle_date_millis": 1762815974348,
      "age": "15.97h",
      "phase": "hot",
      "phase_time_millis": 1762815974461,
      "action": "rollover",
      "action_time_millis": 1762815975461,
      "step": "check-rollover-ready",
      "step_time_millis": 1762815975461,
      "phase_execution": {
        "policy": "7warm15cold40delete",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "min_docs": 1,
              "max_primary_shard_docs": 200000000,
              "max_primary_shard_size": "50gb"
            },
            "set_priority": {
              "priority": 100
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1759800628187
      },
      "skip": false
    }
  }
}
# GET _ilm/policy/7warm15cold40delete

{
  "7warm15cold40delete": {
    "version": 1,
    "modified_date": "2025-10-07T01:30:28.187Z",
    "policy": {
      "phases": {
        "cold": {
          "min_age": "15d",
          "actions": {
            "set_priority": {
              "priority": 0
            }
          }
        },
        "warm": {
          "min_age": "7d",
          "actions": {
            "allocate": {
              "number_of_replicas": 1,
              "include": {},
              "exclude": {},
              "require": {}
            },
            "forcemerge": {
              "max_num_segments": 1
            },
            "readonly": {},
            "set_priority": {
              "priority": 50
            }
          }
        },
        "hot": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_size": "50gb"
            },
            "set_priority": {
              "priority": 100
            }
          }
        },
        "delete": {
          "min_age": "40d",
          "actions": {
            "delete": {
              "delete_searchable_snapshot": true
            }
          }
        }
      }
    },
    "in_use_by": {
      "indices": [
	...
        ".ds-logstash-$some-index-2025.11.10-000026",
        ...
      ],
      "data_streams": [
	...
        "logstash-$some-index"
      ],
      "composable_templates": [
        ...
        $index-template,
        ...
      ]
    }
  }
}

And the associated template:

{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "7warm15cold40delete"
        },
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "mapping": {
          "total_fields": {
            "limit": "2000"
          }
        },
        "number_of_shards": "5",
        "number_of_replicas": "1"
      }
    },
    "mappings": {
      "_data_stream_timestamp": {
        "enabled": true
      },
      "properties": {
        "@timestamp": {
          "type": "date"
        }
      }
    },
    "aliases": {}
  }
}

I had to obfuscate some info such as index and templates names but the settings are all visible. We have several of this policies configured the same we only change the retention days in each one of them.

Please note that we’ve recently started using 8 version so I checked the docs in case I was configuring something deprecated in 7 version or something similar, but I don’t see the problem.

Is there something we’re configuring wrong or something malfunctioning? We know that sometimes shards go a little bit over 50gb and then they rollover, that’s ok. But we have shards reaching 100gb and the ILM says it’s in `"step": "check-rollover-ready"` but half of the times it does not perform the rollover even when the conditions are met. I appreciate any help you can provide us!

Hello @Natalia_Mellino

To troubleshoot this issue we must see all the shards for these index so the command you should use is :

GET _cat/shards/.ds-logstash-$some-index-2025*?v&s=store,store:desc

This index has 5 primary shards & 1 replica shard as per the index template so above command will have 10 records.
We need to know the size of primary (ideally will be same as replica) still what is the size of each primary shard for this index at time of issue?
Also as per your input can you confirm if it works sometime for this index (an example as you said it is for all indices) & sometimes it does not rollover?
Can we check the master node logs to see if there are any messages related to ILM ?

Thanks!!

1 Like

Hello! Sorry for the delay and thank you for your help. I’ve ran just now the request you asked and I see something very differently from what I saw when I sent the post. See:

index                                             shard prirep state          docs  store dataset ip          node
.ds-logstash-$some-index-2025.11.10-000026    1     p      STARTED    42494554 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    0     r      STARTED    42486821 50.6gb  50.6gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    2     p      STARTED    42498245 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    0     p      STARTED    42486821 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    1     r      STARTED    42494554 50.6gb  50.6gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    3     p      STARTED    42494553 50.3gb  50.3gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    3     r      STARTED    42494553 50.3gb  50.3gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    4     p      STARTED    42484812 50.3gb  50.3gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    2     r      STARTED    42498245 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    4     r      STARTED    42484812 50.3gb  50.3gb     $ip         $node

Now all shards are around 50gb which seems ok, but a few days ago for one of the shards I got 56gb like I sent at first:

# GET _cat/shards/.ds*?v&s=store,store:desc

...
.ds-logstash-$some-index-2025.11.10-000026                                                               2     r      STARTED     28659163   56.4gb   56.4gb $ip_address $elasticsearch_node

That is really weird, it is the same shard (same index of course) but the size in the response of the requests is varying, now they all seem ok. This is not the only example anyway, I think I can manage to get another one if this one does not give any information, because it does not happen with this index only.

I can confirm that sometimes it works (50%-50% aprox.).

As per master nodes, should I look or filter for anything in particular? This is a big cluster and we have a lot of logs

I am seeing the same behaviour for other indices as well. The request

GET _cat/shards/.ds*?v&s=store,store:desc

shows one size but when I filter for an specific index:

GET _cat/shards/.ds-logstash-$some-index-2025*?v&s=store,store:desc

to see all shards I see different sizes for the same shards (like happened in the response above). Now I don’t know which query to trust to see if this is actually a problem or not.

All segments in Elasticsearch are immutable, so when merging takes place new merged segments need to be created before old, redundant ones can be deleted. This will lead to the reported shard size fluctuating over time like you captured here. I believe ILM do not just look at the raw current shard size (which would frequently be affected by merging) but instead estimates the merged size.

1 Like

You mean different sizes when comparing queries made 1 second apart, or different values when comparing queries made 1 week apart ?

1 Like

I did not know that, thank you for the answer. Is there by any chance some documentation that talks about this?

Both actually, from what I saw today sizes differ between the two queries:

  • If the two queries are made a week apart the shard sizes show different in the two queries.
  • If the two queries are made one second after the other I see a slightly difference in sizes but it’s very little, they are almost the same (± 1/2gb).
  • I even made the two queries a few minutes apart and the result changed significantly (±10gb difference)

Could it be something in relation about what Christian said and ES is estimating size in one (or both) requests? ILMs in cluster not performing rollover when conditions are met - #5 by Christian_Dahlqvist

I have not found this in the documentation, so it is based on an old memory from around the time rollover was first introduced. As shown in this blog rollover (which I believe ILM uses) did initially not support specifying a maximum size at all as merging would make this very unreliable. I believe this issue discusses the implementation of the size check that I remember. Am not sure if/when this was implemented nor if it is actually in place.

1 Like

Thank you for you help, definitely didn’t have this information. I think I’ll close this for now since I can’t accurately tell if we actually have this problem or not. I’ll reopen or create another topic if I find anything else. Thank you all <3

1 Like