Deployment down because of an ml job

Hello everyone,
Recently, I had an issue with my deployment. I create an outliner detection jobs and it run out of memory. However, before I could stop it or reconfigure it the deployment went down.

I already tried to load a precedent snapshot but kibana is also down so I cannot acces the "Snapshot & restore" and the request from the API console doesn't return anything except error.
I've also tried to temporaly increase the RAM but the change failed.

I contacted the support team but I've only got an automatic response that it'll be look upon within 3 business day. I can't afford to wait this long so if anyone have a suggestion I'll gladly accept it !

Thank you for your understanding.

Hi @vmesis Welcome to the community.

Did you open the ticket as a Sev 1? I suspect you opened it as a Sev 3.

I would close that ticket and open an new Sev 1 Ticket with Cluster Down / Red in the Subject

I would not try to go straight to Snapshot and Restore, you cluster needs to be healthy before you could do that.

We are not support here so we can not really help with underlying deployment issues

Hi @stephenb, thanks for your reply.
Indeed, I opened it as a Sev 3 but not on purpose. I tried to open a case directly from the support page but the technical support button is grey and deactivated. I'll add that I have the subscription for this.
So instead, I sent an email directly to the support and it have been put in Sev 3 automatically.

Perhaps someone else in your organization can use the support portal to properly open a ticket.

You may not be added as a support contact.

I think you are getting a little help now...

1 Like