ILM warm shrinked indicies vs delete phase

Running a v.8.15.0 cluster of 3x data nodes and got a 30 days ILM policy like this:

{
  "30-days-default": {
    "version": 3,
    "modified_date": "2023-03-17T23:13:14.838Z",
    "policy": {
      "phases": {
        "delete": {
          "min_age": "30d",
          "actions": {
            "delete": {
              "delete_searchable_snapshot": true
            }
          }
        },
        "warm": {
          "min_age": "7d",
          "actions": {
            "forcemerge": {
              "max_num_segments": 1
            },
            "shrink": {
              "number_of_shards": 1,
              "allow_write_after_shrink": false
            }
          }
        },
        "hot": {
          "min_age": "0ms",
          "actions": {
            "rollover": {
              "max_age": "30d",
              "max_primary_shard_size": "50gb"
            }
          }
        }
      },
      "_meta": {
        "description": "built-in ILM policy using the hot and warm phases with a retention of 30 days",
        "managed": true
      }
    },
    "in_use_by": {...}
}

It seemed to have worked fine in the past, but somehow this failed to truncate a large index segment in the warm phase today, now thus having a alert due to a single large sized shard.

How best to deal with this?

Wondering if such large indicies shouldn't be shrinked to just 1 primary shard or maybe split on size before reaching 50GB.

Correct me if I'm wrong/misinterpreting but to me it seems it follows the policy as you have it specified? In the hot tier you say maximum 30 days or a shard size of 50 gb.. Looking at the output it was rotated on the age condition, and the shard is less than 50 gb.. The size reported is the total size of the index so you get over 50 gb size for the primary and replica combined, but the shard itself is less than 50 gb. To me it seems like it's acting according to the ILM policy as setup (which might in turn not be exactly what you require of course).

Sorry, yes but I expected this warm shard holds to much data, as I expected data to move from warm to the delete phase after +30 days, but I see ImHO way too old data in this index:

Even though 25G for a primary shard in with recommendation from the blog on shard size and numbers, 20-40G, it still alerts on it's size...

Yeah, so one thing keep in mind is that if you're using rollover things gets a bit complicated. min_age refers to when the index/ds was created, but if you do a rollover a new index is created a time of rollover. So for example, if you have hot for 30 days and then rollover, the age of the new created index will be 0, and 7 days will then be 7 days after rollover. More details can be found

Additionally, as I think you already know, it can be worth noting that the data in the index and the dates that ILM uses doesn't necessarily have anything to do with each other. You could for various reasons have data in an index that are older (or newer) than expected, but ILM never acts on the actual data inside the index.

No, but since it's purely timeserie log data, there ought to be some level of correlation with index creation/rollover dates and data timestamps :slight_smile:

Thus I find it a bit weird that the bespoken index explains it's ILM like this:

{
  "indices": {
    "shrink-bif1-.ds-epj_camel_logs-2024.07.19-000046": {
      "index": "shrink-bif1-.ds-epj_camel_logs-2024.07.19-000046",
      "managed": true,
      "policy": "30-days-default",
      "index_creation_date_millis": 1724590269676,
      "time_since_index_creation": "14.99d",
      "lifecycle_date_millis": 1723983321520,
      "age": "22.01d",
      "phase": "warm",
      "phase_time_millis": 1724588721509,
      "action": "complete",
      "action_time_millis": 1724595321501,
      "step": "complete",
      "step_time_millis": 1724595321501,
      "shrink_index_name": "shrink-bif1-.ds-epj_camel_logs-2024.07.19-000046",
      "phase_execution": {
        "policy": "30-days-default",
        "phase_definition": {
          "min_age": "7d",
          "actions": {
            "shrink": {
              "number_of_shards": 1,
              "allow_write_after_shrink": false
            },
            "forcemerge": {
              "max_num_segments": 1
            }
          }
        },
        "version": 3,
        "modified_date_in_millis": 1679094794838
      }
    }
  }
}

Hm yes I'm beginning to realize that it might be difficult to estimate when rollover warm indicies are created yet again further moved to the delete phase... even though initial/first warm phase should start 7 days after first index is created. But then is it the full index that is shrinked to 1 primary shard not just the data timestamped +7 days ago?

Seems I might have to re-read up on ILM :slight_smile:

As long as you're not ingesting late logs there should be a fairly good correlation yes :slight_smile: I can say that in my experience, the only issue has been related to timezone problems and large volumes of log, otherwise they are always fairly connected.

I always found that the easiest way to test my understanding with ILM is to setup a testpolicy with short times, like you have hot for 1 hour, warm for 1 hour, delete 1 hour or something like that because then you can test your policys with various conditions and see that your understanding is right.

And yes, ILM always work on full/closed indices, it does not work on a subset of data in one index. Thus, for a ILM cycle to continue, the index must first be closed either due to size or age, then the next phase of the cycle can commence. All data in the index will be moved simultaneusly.