We have some problem with ILM not working as intended, at least it seems like that. We have added cold node to the cluster and I have set up policy logs to default levels - 50Gb or 30days, as our hot instance allows that, then after 30 days it's being moved to cold.
{
"logs" : {
"version" : 25,
"modified_date" : "2022-06-16T09:30:36.061Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_primary_shard_size" : "50gb",
"max_age" : "30d"
},
"set_priority" : {
"priority" : 100
}
}
},
"cold" : {
"min_age" : "30d",
"actions" : {
"allocate" : {
"number_of_replicas" : 0,
"include" : { },
"exclude" : { },
"require" : { }
},
"set_priority" : {
"priority" : 50
}
}
}
}
},
ILM kinda started to work and put all those old indices into migrate status but they are stuck in migrate state already for some time and looking at the disk IO does not seem that cluster is trying to write them into cold nodes all the time.
Action status
[.ds-logs-endpoint.events.network-default-2022.04.12-000012] lifecycle action [migrate] waiting for [1] shards to be moved to the [data_cold] tier (tier migration preference configuration is [data_cold, data_warm, data_hot])
{
"indices" : {
".ds-logs-endpoint.events.network-default-2022.04.12-000012" : {
"index" : ".ds-logs-endpoint.events.network-default-2022.04.12-000012",
"managed" : true,
"policy" : "logs",
"index_creation_date_millis" : 1649738075698,
"time_since_index_creation" : "65.25d",
"lifecycle_date_millis" : 1649789075719,
"age" : "64.66d",
"phase" : "cold",
"phase_time_millis" : 1655284574757,
"action" : "migrate",
"action_time_millis" : 1655284575359,
"step" : "check-migration",
"step_time_millis" : 1655284576160,
"step_info" : {
"message" : "[.ds-logs-endpoint.events.network-default-2022.04.12-000012] lifecycle action [migrate] waiting for [1] shards to be moved to the [data_cold] tier (tier migration preference configuration is [data_cold, data_warm, data_hot])",
"shards_left_to_allocate" : 1,
"all_shards_active" : true,
"number_of_replicas" : 0
},
"phase_execution" : {
"policy" : "logs",
"phase_definition" : {
"min_age" : "30d",
"actions" : {
"set_priority" : {
"priority" : 50
}
}
},
"version" : 24,
"modified_date_in_millis" : 1655107428064
}
}
}
}
So we just can't understand if it's slowly copying data over or it's stuck somewhere. We have 2 hot nodes and one of them in the same location as cold node, so copying between those 2 must be fast.
There are also indices with status check rollover ready but they have not reached 30 days or 50Gb so I'm also wondering what it's trying to do and why status is rollover.
{
"indices" : {
".ds-logs-nginx.error-default-2022.05.31-000003" : {
"index" : ".ds-logs-nginx.error-default-2022.05.31-000003",
"managed" : true,
"policy" : "logs",
"index_creation_date_millis" : 1653999275177,
"time_since_index_creation" : "15.93d",
"lifecycle_date_millis" : 1653999275177,
"age" : "15.93d",
"phase" : "hot",
"phase_time_millis" : 1654888740663,
"action" : "rollover",
"action_time_millis" : 1654888742464,
"step" : "check-rollover-ready",
"step_time_millis" : 1654888742464,
"phase_execution" : {
"policy" : "logs",
"phase_definition" : {
"min_age" : "0ms",
"actions" : {
"set_priority" : {
"priority" : 100
},
"rollover" : {
"max_primary_shard_size" : "50gb",
"max_age" : "30d"
}
}
},
"version" : 25,
"modified_date_in_millis" : 1655371836061
}
}
}
}