Why to stop ML during rolling upgrade?



The docs for rolling upgrade says " Stop any machine learning jobs that are running" without any explanation why. Could somebody please explain why this step is required?

(David Roberts) #2

It's not a hard requirement in recent versions, but we consider it good practice if it's not too much trouble as it:

  1. Reduces load on the cluster
  2. Ensures all your jobs persist their latest model state immediately before the upgrade and restore that after the upgrade is complete

If the version you are upgrading from is 5.4, 6.1 or 6.2 then you must close jobs before upgrading or you will suffer bad side effects due to bugs in these versions. (If you fail to close your jobs before upgrading from 5.4 then you will end up with incorrect index mappings that prevent you using ML in 6.5, and if you fail to close your jobs before upgrading from 6.1 or 6.2 then the jobs will probably have to re-learn from scratch in the version you upgrade to.)

But for other versions it's not absolutely essential to close your jobs before upgrading. We're in the process of updating the docs for this area. If you want a sneak preview see https://github.com/elastic/elasticsearch/pull/38876/files


Should be ML jobs stopped (or enabled _ml/set_upgrade_mode?enabled=true in future ES versions) during the whole upgrade process? We are upgrading in phases - master nodes, data nodes, ML nodes, coordinate nodes. This process takes usually about 1 hour which is a long time for having ML disabled. But setting ML upgrade mode during phase of upgrading ML nodes would be fine. Is it sufficient to enable this mode during upgrading ML nodes only?

(David Roberts) #4

It's only during reindexing of ML indices created in 5.x prior to upgrading to 7.x that you really must enable upgrade mode.

Otherwise upgrading with ML jobs enabled will just cause them to shift between nodes as you do your rolling upgrade. Potentially this could happen multiple times, which makes the cluster do extra work. But if you prefer this to stopping the jobs then that's fine - leave them running during your upgrade.

Each time a job relocates to a different node it will restore the last model state it persisted. This will be 0-4 hours old, so the model will no longer have knowledge of any major changes that occurred in the last 0-4 hours. Avoiding this is one further benefit of closing jobs before upgrade, but in most cases will not make much difference to results. (The case where it makes a huge difference is if you're upgrading from 6.1 or 6.2, which had a bug with periodic model persistence.)