Why to stop ML during rolling upgrade?

vb4t · February 11, 2019, 9:15pm

The docs for rolling upgrade says " Stop any machine learning jobs that are running" without any explanation why. Could somebody please explain why this step is required?

droberts195 · February 15, 2019, 10:42am

It's not a hard requirement in recent versions, but we consider it good practice if it's not too much trouble as it:

Reduces load on the cluster
Ensures all your jobs persist their latest model state immediately before the upgrade and restore that after the upgrade is complete

If the version you are upgrading from is 5.4, 6.1 or 6.2 then you must close jobs before upgrading or you will suffer bad side effects due to bugs in these versions. (If you fail to close your jobs before upgrading from 5.4 then you will end up with incorrect index mappings that prevent you using ML in 6.5, and if you fail to close your jobs before upgrading from 6.1 or 6.2 then the jobs will probably have to re-learn from scratch in the version you upgrade to.)

But for other versions it's not absolutely essential to close your jobs before upgrading. We're in the process of updating the docs for this area. If you want a sneak preview see https://github.com/elastic/elasticsearch/pull/38876/files

vb4t · February 15, 2019, 11:29am

Should be ML jobs stopped (or enabled _ml/set_upgrade_mode?enabled=true in future ES versions) during the whole upgrade process? We are upgrading in phases - master nodes, data nodes, ML nodes, coordinate nodes. This process takes usually about 1 hour which is a long time for having ML disabled. But setting ML upgrade mode during phase of upgrading ML nodes would be fine. Is it sufficient to enable this mode during upgrading ML nodes only?

droberts195 · February 15, 2019, 11:49am

It's only during reindexing of ML indices created in 5.x prior to upgrading to 7.x that you really must enable upgrade mode.

Otherwise upgrading with ML jobs enabled will just cause them to shift between nodes as you do your rolling upgrade. Potentially this could happen multiple times, which makes the cluster do extra work. But if you prefer this to stopping the jobs then that's fine - leave them running during your upgrade.

Each time a job relocates to a different node it will restore the last model state it persisted. This will be 0-4 hours old, so the model will no longer have knowledge of any major changes that occurred in the last 0-4 hours. Avoiding this is one further benefit of closing jobs before upgrade, but in most cases will not make much difference to results. (The case where it makes a huge difference is if you're upgrading from 6.1 or 6.2, which had a bug with periodic model persistence.)

system · March 15, 2019, 11:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Error when trying to put ML into upgrade mode (6.5 to 6.7) Elasticsearch elastic-stack-machine-learning	2	1013	April 25, 2019
Machine Learning - Index migration in progress Elasticsearch elastic-stack-machine-learning	3	1314	December 16, 2020
Machine learning problems in Elastic 6 Elasticsearch	3	1181	January 12, 2018
ML node restart, job recovery Elasticsearch elastic-stack-machine-learning	3	893	August 7, 2020
Rolling Upgrade? Elasticsearch	3	776	July 6, 2017

Why to stop ML during rolling upgrade?

Related topics