Hi All,
I recently noticed some CPU usage issues on my hot data nodes. When looking at the hot threads (GET /_nodes/<node_id>/hot_threads
). I noticed that the main cause was force merging during rollover from hot -> warm.
Example thread:
::: {<node_id>}{<snipped>}{<snipped>}{10.42.0.201}{10.42.0.201:9300}{hs}{k8s_node_name=<snipped>, xpack.installed=true, zone=rack1, transform.node=false}
Hot threads at 2022-01-28T17:52:23.277Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
100.0% [cpu=38.4%, other=61.6%] (500ms out of 500ms) cpu usage by thread 'elasticsearch[<node_id>][[filebeat-7.16.3-2022.01.28-000003][0]: Lucene Merge Thread #734]'
What confuses me, and something that I haven't been able to find in the docs, is why a warm
action of my ILM policy (force merge) is being executed on a hot node. I'd expect warm
actions to be executed on warm
nodes, not hot nodes.
Would anyone have an explanation for why this is, or point me in the direction of the docs which would explain this?
ILM Policy in question:
PUT _ilm/policy/filebeat
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"readonly": {},
"rollover": {
"max_age": "30d",
"max_primary_shard_size": "50gb"
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "5d",
"actions": {
"forcemerge": {
"max_num_segments": 1,
"index_codec": "best_compression"
},
"readonly": {},
"set_priority": {
"priority": 50
}
}
},
"cold": {
"min_age": "30d",
"actions": {
"readonly": {},
"searchable_snapshot": {
"snapshot_repository": "es-prod-snapshots",
"force_merge_index": true
},
"set_priority": {
"priority": 0
},
"allocate": {
"number_of_replicas": 0
}
}
},
"delete": {
"min_age": "500d",
"actions": {
"delete": {
"delete_searchable_snapshot": true
},
"wait_for_snapshot": {
"policy": "snap_all"
}
}
}
}
}
}
Output of GET filebeat-7.16.3-2022.01.28-000003/_ilm/explain
{
"indices" : {
"filebeat-7.16.3-2022.01.28-000003" : {
"index" : "filebeat-7.16.3-2022.01.28-000003",
"managed" : true,
"policy" : "filebeat",
"lifecycle_date_millis" : 1643389546068,
"age" : "55.64m",
"phase" : "hot",
"phase_time_millis" : 1643389548826,
"action" : "rollover",
"action_time_millis" : 1643389551427,
"step" : "check-rollover-ready",
"step_time_millis" : 1643389551427,
"phase_execution" : {
"policy" : "filebeat",
"phase_definition" : {
"min_age" : "0ms",
"actions" : {
"readonly" : { },
"rollover" : {
"max_primary_shard_size" : "50gb",
"max_age" : "30d"
},
"set_priority" : {
"priority" : 100
}
}
},
"version" : 11,
"modified_date_in_millis" : 1630510914564
}
}
}
}