That all makes sense, I think. This is certainly not a path we'd like to go down, especially not lightly, we're weighing up our options. For some extra context on where we've had problems:
- In terms of cross version compatibility; we're aware of this, and from 2-6 we struggled particularly with the removal of string and parent child relationships. We could have gone through 5.6, but I think we'd still be reticent to do an in-place upgrade (it would have gone awfully) because
- it's sometimes quite hard for us to disentangle an ES release from a release of our client, because in a couple of cases our developers have unwittingly exposed subtle parts of the ES API into our own API (stuff like whether NaNs index without failing, or the ES field count limit) which has sometimes meant that we've learned about an ES break when our own API has broken in ways we never meant to expose (Hyrum's Law in action). When you hit this, not having an exit strategy is rather painful. It's generally ok if it's only our code that needs to be fixed, but if we ship a change where we need to notify our customers and have them change their configurations, that obviously leaves a sour taste.
- This is obviously epic tech debt, which we've now paid down, but we found a couple of places where engineers on one team had accessed the index of another team. Inevitably some of them did not add support for the new ES version, we flipped onto the new version, and everything broke for them.
- We've occasionally (I can think of twice) had issues with performance regressions. As one example, we had one ES upgrade where after a few days performance completely died, but predictably only on our largest and most important cluster (https://github.com/elastic/elasticsearch/pull/56708). In the end, we tracked it down, but it took a while, and in the meantime we'd nuked the cluster and started over. It's a scary prospect to be considering fully reindexing dozens of terabytes in order to restore service (thankfully this was during the ES2-6 migration and so we had a hot ES2 cluster to fall back on). I think in the other case we just ate degraded performance for a month on one of our clusters, but I don't remember the details.
Because of this, in general our strategy's been 'real life is messy, bring a shovel'. From ES2 to ES6 we ran two clusters in parallel, which was expensive from a coordination and dev perspective but was the only viable approach at the time given our environment (maybe we could have gone through ES5, but that'd be a different kettle of fish which isn't obviously easier given the issues we saw in practice). It meant we could do things like have internal users run all their queries through ES6. We're looking at reducing the coordination burden while maintaining our ability to roll back (which we found to be a development accelerant).
In terms of some changes literally just not being forwards compatible (e.g. adding a new data type) our process is that we don't use new features until we know we won't want to roll back (we gate the ES APIs our devs can use). Ideally if you don't use a feature in the new format, you have a viable rollback path. But, clearly this might not work for every upgrade, and certainly not every way of writing an upgrade. We also use a fairly sparse set of ES features (e.g. no scripts, reindexes, sql, etc).
In terms of the safety things, yeah, those make a lot of sense - we're mostly investigating what the work would look like right now (probably looking to see what it would have been to get from 6.0 to 6.1 or 6.8 to 7.0 (the changes for both of those are beefier then 6.7 to 6.8)).
Hopefully that's useful for you and should explain a bit more about why we're considering a path like this (and to be clear, it's very much a science-project can we make our lives easier thing).
But, to make sure I understand... If we were to go down this path, am I right in thinking:
- Generally what we would care about are PRs that add new indexing features, of which there are typically many around a new release. But really what we'd care about are changes to classes that interact with Lucene, or the metadata files, and specifically, not new features, but pre-existing features that are changed to behave differently (and in an incompatible way).
- Are there things I'm missing with the above?
- Or lucene upgrades.
- Upgrades that change the cluster state storage would also be a part of this, seems like those would be much harder to workaround given you'd effectively have to implement a backwards migration between e.g. ES6 and ES7, but at least you can update master and data nodes independently.
- If the translog format is changed, we would also care about this, but that should change less frequently and so can be more of a sanity check.
My expectation is that this is probably pretty viable for later minor versions (e.g. 6.6 vs 6.5 looks fairly tractable) but it looks much trickier for major versions, and that's where the wins are.
Thank you for your replies!