ILMs in cluster not performing rollover when conditions are met

Natalia_Mellino · November 11, 2025, 3:23pm

Hello! We have an Elasticsearch cluster running 8.19.3 version (master, hot, cold and coordinating nodes) and we’re seeing some issues regarding our ILM policies. We are using data streams so we’ve configured several ILM policies to handle data rotation. In the hot phase we explicitly configure that we want to rollover a data stream when it reaches a 50gb size or it reaches 30 days old. The thing is that Elasticsearch does not seem to be doing it half of the times. We see this for all of our ILMs, I’ll leave an example below.

This is a shard that’s over 50gb:

# GET _cat/shards/.ds*?v&s=store,store:desc

...
.ds-logstash-$some-index-2025.11.10-000026                                                               2     r      STARTED     28659163   56.4gb   56.4gb $ip_address $elasticsearch_node

We have cases where shards reach over to 70gb or even 100gb and they do not get rollover, I do not have a bigger example at the moment since I had to rollover them yesterday by hand because they cause performance issues.

This are our related ILM and templates configurations for that index, nothing seems to be odd:

# GET .ds-logstash-$some-index-2025.11.10-000026/_ilm/explain

{
  "indices": {
    ".ds-logstash-$some-index-2025.11.10-000026": {
      "index": ".ds-logstash-$some-index-2025.11.10-000026",
      "managed": true,
      "policy": "7warm15cold40delete",
      "index_creation_date_millis": 1762815974348,
      "time_since_index_creation": "15.97h",
      "lifecycle_date_millis": 1762815974348,
      "age": "15.97h",
      "phase": "hot",
      "phase_time_millis": 1762815974461,
      "action": "rollover",
      "action_time_millis": 1762815975461,
      "step": "check-rollover-ready",
      "step_time_millis": 1762815975461,
      "phase_execution": {
        "policy": "7warm15cold40delete",
        "phase_definition": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "min_docs": 1,
              "max_primary_shard_docs": 200000000,
              "max_primary_shard_size": "50gb"
            },
            "set_priority": {
              "priority": 100
            }
          }
        },
        "version": 1,
        "modified_date_in_millis": 1759800628187
      },
      "skip": false
    }
  }
}

# GET _ilm/policy/7warm15cold40delete

{
  "7warm15cold40delete": {
    "version": 1,
    "modified_date": "2025-10-07T01:30:28.187Z",
    "policy": {
      "phases": {
        "cold": {
          "min_age": "15d",
          "actions": {
            "set_priority": {
              "priority": 0
            }
          }
        },
        "warm": {
          "min_age": "7d",
          "actions": {
            "allocate": {
              "number_of_replicas": 1,
              "include": {},
              "exclude": {},
              "require": {}
            },
            "forcemerge": {
              "max_num_segments": 1
            },
            "readonly": {},
            "set_priority": {
              "priority": 50
            }
          }
        },
        "hot": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_size": "50gb"
            },
            "set_priority": {
              "priority": 100
            }
          }
        },
        "delete": {
          "min_age": "40d",
          "actions": {
            "delete": {
              "delete_searchable_snapshot": true
            }
          }
        }
      }
    },
    "in_use_by": {
      "indices": [
	...
        ".ds-logstash-$some-index-2025.11.10-000026",
        ...
      ],
      "data_streams": [
	...
        "logstash-$some-index"
      ],
      "composable_templates": [
        ...
        $index-template,
        ...
      ]
    }
  }
}

And the associated template:

{
  "template": {
    "settings": {
      "index": {
        "lifecycle": {
          "name": "7warm15cold40delete"
        },
        "routing": {
          "allocation": {
            "include": {
              "_tier_preference": "data_hot"
            }
          }
        },
        "mapping": {
          "total_fields": {
            "limit": "2000"
          }
        },
        "number_of_shards": "5",
        "number_of_replicas": "1"
      }
    },
    "mappings": {
      "_data_stream_timestamp": {
        "enabled": true
      },
      "properties": {
        "@timestamp": {
          "type": "date"
        }
      }
    },
    "aliases": {}
  }
}

I had to obfuscate some info such as index and templates names but the settings are all visible. We have several of this policies configured the same we only change the retention days in each one of them.

Please note that we’ve recently started using 8 version so I checked the docs in case I was configuring something deprecated in 7 version or something similar, but I don’t see the problem.

Is there something we’re configuring wrong or something malfunctioning? We know that sometimes shards go a little bit over 50gb and then they rollover, that’s ok. But we have shards reaching 100gb and the ILM says it’s in `"step": "check-rollover-ready"` but half of the times it does not perform the rollover even when the conditions are met. I appreciate any help you can provide us!

Tortoise · November 13, 2025, 6:14am

Hello @Natalia_Mellino

To troubleshoot this issue we must see all the shards for these index so the command you should use is :

GET _cat/shards/.ds-logstash-$some-index-2025*?v&s=store,store:desc

This index has 5 primary shards & 1 replica shard as per the index template so above command will have 10 records.
We need to know the size of primary (ideally will be same as replica) still what is the size of each primary shard for this index at time of issue?
Also as per your input can you confirm if it works sometime for this index (an example as you said it is for all indices) & sometimes it does not rollover?
Can we check the master node logs to see if there are any messages related to ILM ?

Thanks!!

Natalia_Mellino · November 17, 2025, 5:31pm

Hello! Sorry for the delay and thank you for your help. I’ve ran just now the request you asked and I see something very differently from what I saw when I sent the post. See:

index                                             shard prirep state          docs  store dataset ip          node
.ds-logstash-$some-index-2025.11.10-000026    1     p      STARTED    42494554 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    0     r      STARTED    42486821 50.6gb  50.6gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    2     p      STARTED    42498245 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    0     p      STARTED    42486821 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    1     r      STARTED    42494554 50.6gb  50.6gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    3     p      STARTED    42494553 50.3gb  50.3gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    3     r      STARTED    42494553 50.3gb  50.3gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    4     p      STARTED    42484812 50.3gb  50.3gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    2     r      STARTED    42498245 50.2gb  50.2gb     $ip         $node
.ds-logstash-$some-index-2025.11.10-000026    4     r      STARTED    42484812 50.3gb  50.3gb     $ip         $node

Now all shards are around 50gb which seems ok, but a few days ago for one of the shards I got 56gb like I sent at first:

# GET _cat/shards/.ds*?v&s=store,store:desc

...
.ds-logstash-$some-index-2025.11.10-000026                                                               2     r      STARTED     28659163   56.4gb   56.4gb $ip_address $elasticsearch_node

That is really weird, it is the same shard (same index of course) but the size in the response of the requests is varying, now they all seem ok. This is not the only example anyway, I think I can manage to get another one if this one does not give any information, because it does not happen with this index only.

I can confirm that sometimes it works (50%-50% aprox.).

As per master nodes, should I look or filter for anything in particular? This is a big cluster and we have a lot of logs

Natalia_Mellino · November 17, 2025, 5:37pm

I am seeing the same behaviour for other indices as well. The request

GET _cat/shards/.ds*?v&s=store,store:desc

shows one size but when I filter for an specific index:

GET _cat/shards/.ds-logstash-$some-index-2025*?v&s=store,store:desc

to see all shards I see different sizes for the same shards (like happened in the response above). Now I don’t know which query to trust to see if this is actually a problem or not.

Christian_Dahlqvist · November 17, 2025, 6:12pm

All segments in Elasticsearch are immutable, so when merging takes place new merged segments need to be created before old, redundant ones can be deleted. This will lead to the reported shard size fluctuating over time like you captured here. I believe ILM do not just look at the raw current shard size (which would frequently be affected by merging) but instead estimates the merged size.

RainTown · November 17, 2025, 6:45pm

You mean different sizes when comparing queries made 1 second apart, or different values when comparing queries made 1 week apart ?

Natalia_Mellino · November 17, 2025, 7:17pm

I did not know that, thank you for the answer. Is there by any chance some documentation that talks about this?

Natalia_Mellino · November 17, 2025, 7:22pm

Both actually, from what I saw today sizes differ between the two queries:

If the two queries are made a week apart the shard sizes show different in the two queries.
If the two queries are made one second after the other I see a slightly difference in sizes but it’s very little, they are almost the same (± 1/2gb).
I even made the two queries a few minutes apart and the result changed significantly (±10gb difference)

Could it be something in relation about what Christian said and ES is estimating size in one (or both) requests? ILMs in cluster not performing rollover when conditions are met - #5 by Christian_Dahlqvist

Christian_Dahlqvist · November 17, 2025, 7:56pm

I have not found this in the documentation, so it is based on an old memory from around the time rollover was first introduced. As shown in this blog rollover (which I believe ILM uses) did initially not support specifying a maximum size at all as merging would make this very unreliable. I believe this issue discusses the implementation of the size check that I remember. Am not sure if/when this was implemented nor if it is actually in place.

Natalia_Mellino · November 18, 2025, 6:30pm

Thank you for you help, definitely didn’t have this information. I think I’ll close this for now since I can’t accurately tell if we actually have this problem or not. I’ll reopen or create another topic if I find anything else. Thank you all <3

Topic		Replies	Views
ILM doesn't start indexes rollover Elasticsearch	9	625	May 13, 2020
Elasticsearch ILM rollover from hot to warm phase not happening even when the condition for rollover is met Elasticsearch ilm-index-lifecycle-management	8	2178	August 29, 2019
Elasticsearch ilm rollover NOT applied as it should on datastreams v8.5.2 Elasticsearch ilm-index-lifecycle-management , datastreams	5	845	March 28, 2023
ILM not rolling over at correct size Elasticsearch	11	3361	August 12, 2019
ILM my index is not doing rollover properly Elasticsearch ilm-index-lifecycle-management	5	414	August 7, 2021

ILMs in cluster not performing rollover when conditions are met

Related topics