So I've recently migrated our observability cluster from Elastic 7.9 to Elastic 7.10, and noticed some interesting gotchas that are worthy of discussion. I don't think these are bugs per say but there is definitely some behaviour here that warrants further investigation, or at the very least documentation.
To give a background of our pipeline, we run Beats, which write to Logstash, which then writes to an index that setup with a custom ILM policy to do rollover (as was the way before Ingest pipelines) sat on ECK version 1.2.1
Now our ECK architecture uses a custom LB object which only selects the hot nodes from our cluster, since thats where we want all the data written to. We ensure this behaviour is seen on the Elastic side, by setting up our index templates with
"index.routing.allocation.require.data": "hot"
where data
is an attribute of the hot nodes we want the data written to. This is you bog standard way of setting up ILM, or at least it was before ILM roles were introduced.
Now the reason this works, is because when ILM moves the index through its stages, it will update this field to warm
and then to cold
, all attributes of our nodes. With roles based ILM, this is subtly changed, by making use of the _data_tier
attribute to dictate routing. This has the added benefit of allowing shards to be allocated to the tier above it should that tier run out of space (we'll touch on this again in a moment)
So then, our migration day comes, we change our node attributes and config to remove the legacy way of adding roles, add the new relevant data tiers, however for posterity we kept the existing node attributes (hot, warm, cold etc).
Cluster migrated, all goes okay, however the hot nodes are just filling up non-stop. Indexes are not being moved into the warm nodes. After finally finding a lunchtime to look at why, we discover that indexes are correctly being first written to the hot nodes (thanks to the use of the data_content
role being on the hot nodes), the index the rolls over to the warm phase, and sets its _data_tier
to data_warm,data_hot
but then cannot actually move to those nodes, because our index template (which we did not change) had continued to put require.data: hot
on each of the indexes, and the ILM policies were no longer changing them
Now I get why this happens; there exist use cases where you want to route indexes according to arbitrary node attributes, and migrating to roles based ILM should not interfere with that. That said, I doubt I'm the only person with this setup on Elastic. I'm sure there is a sufficiently interesting technical solution such as interrogating which attributes are using in ILM, and adjusting behaviour accordingly, but a simpler solution would be to just document this behaviour somewhere as a gotcha to watch out for (I tried to do this myself but I couldnt find the best place to put it other than Twitter)
The other interesting thing that I saw, was that getting the ILM policies in question, they did not include the migrate
action like I would expect? The docs say that this is enabled by default and does not need explicitly adding unless you had other manual routing requirements in there as well (which of course I did at the time)
Furthermore, modifications to the ILM policies were in fact made, but only in half a capacity. Moving from hot
-> warm
nodes was configured to use roles, but moving from warm
-> cold
nodes was still configured to use attributes based routing. I'm not sure what drove that behaviour, since the cluster had nodes of all 3 roles.
I'd also highly recommend not moving to roles based ILM and upgrading to 7.10 in 1 go (ECK makes this very tempting but alas, do not).
Actions I took to fix this:
- Removing the
require.data: hot
setting from all index templates - Manually combing through indexes which still had this setting and removing it:
PUT auditbeat-7-*/_settings
{
"index": {
"routing": {
"allocation": {
"require": {
"data": null
}
}
}
}
}
- Using the API (not the UI) to fix the ILM policies, including the relevant
migrate{}
action for all policies
I'm sure this was a bit of a corner case in terms of moving to roles based ILM, but I'd also imagine I'm not the only one who has a setup like this.