Transform nicing

Hi guys,

We make use of transforms extensively in a production cluster (6 nodes of 20 CPU 32GB RAM, all nodes have all roles) with approximatively 250 transforms running in continuous mode. Some transforms have a frequency of 15m and some others have 1h (these high intervals are enough for our use case).

We experience recurring load spikes at the same moment every hour. When checking the transform stats in Kibana during the spikes I can see lots of (up to 60) transforms in state "indexing" at the same time.

While I totally understand the way the periodic scheduler works, I think it would be nice to support a "nice" mode cluster setting for transforms, which would allow Elasticsearch to automatically re-arrange transforms so that they don't run all at the same time. The workaround for now would be to stop/start transforms manually to restart the scheduler at another moment. This is cumbersome and also totally lost in the case of node restarts, because transforms will be restarted in this case, all at the same time !

What do you think of this idea? Is it worth an improvement issue on Github?

Do you have any insight about the issue we're experiencing?

Thanks,
David

Hi,
Thanks for bringing this up!

The workaround for now would be to stop/start transforms manually to restart the scheduler at another moment.

There is another (simpler) workaround provided you're at version >= 8.7.0:
You can use _schedule_now API (docs). If you call it, the next time the transform runs will be now + frequency.

and also totally lost in the case of node restarts, because transforms will be restarted in this case, all at the same time

Yes, that's still true even with the workaround I described above.

What do you think of this idea? Is it worth an improvement issue on Github?

Sure, feel free to create a GH issue. You described the problem well in this post so you can just copy-paste the text into the issue if you like.

Do you have any insight about the issue we're experiencing?

We have been aware of this issue for some time, unfortunately we did not yet get to fix it. Unfortunately I cannot provide any guarantees about if/when this will be properly fixed.

Thanks for the quick reply. We're currently using v8.5.1 but I'm looking into upgrading soon.
I'll create the Github issue.

Cheers,
David

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.